Batch Set Review of Similar Documents

This use case document provides details on how to utilize H5MA's near-duplicate identification output to group and batch similar documents to help expedite review and bolster the consistency of document review assessments.

What is Near Duplicate Identification?

"Near duplicates" are documents that have been identified and grouped based on their text similarity.  Each document belongs to only one near-duplicate set, and each near-duplicate set contains at least one document.  Similarity parameters are configurable before running a Near Duplicate analytics set. 

There is only one primary feature in near-duplicate identification that is important for batching documents to keep near-duplicate groups together:

  • Near Duplicate Group:  A Near Duplicate group will contain the full set of documents that have been identified containing similar text-based on similarity thresholds that have been set

Batching Rules for Grouping Similar Content (Near Duplicates) for Reviewers

These are our recommended settings for a source saved search for a near duplicate batch set intended to group similar documents during review:

Saved Search Criteria:

  • NDGroup is set

Sort Criteria:

  • NDGroup (asc)
  • Control Number (asc)

NOTE:  Consider using your "Last Modified Date" instead of your Control Number to put your near-duplicate documents in chronological or "version" order

Batch Set Grouping to group the near duplicate documents together:

  • Family Field: NDGroup

All other settings can be configured according to the needs of the project.