Near Duplicate Pivot Documents to Reduce Review Volume

This use case document provides details on utilizing H5MA's near-duplicate identification output to focus review efforts on pivot documents within near-duplicate document groups to reduce review volumes to only one document per near-duplicate group.

What is Near Duplicate Identification?

"Near duplicates" are documents that have been identified and grouped based on their text similarity.  Each document belongs to only one near-duplicate set, and each near-duplicate set contains at least one document.  Similarity parameters are configurable before running a Near Duplication analytics set. 

There are two primary features in near-duplicate identification that are important when reducing review volumes to focus only on near duplicate pivot documents:

  • Near Duplicate Pivot:  The pivot document will be based on the document within a near duplicate group that represents the median-sized document of a near duplicate set (note: pivots will not adjust once assigned, such as with an incremental run). The median-sized document tends to be the document that shares the most similarities with all other documents in the group.
  • Near Duplicate Group:  A Near Duplicate group will contain the full set of documents that have been identified containing similar text based on similarity thresholds that have been set

Saved Search Rules for Returning Pivot Near Duplicate Documents for Reviewers

These are our recommended settings for a source saved search for a near duplicate document set intended to only return a single pivot document per near duplicate document group:

Saved Search Criteria:

  • NDGroup is set
  • ::Is Pivot = Y

Relativity Search Criteria for Near Duplicate Pivot Documents

All other settings can be configured according to the needs of the project.