Batch Set Review of Similar Documents
This use case document provides details on how to utilize H5MA's near-duplicate identification output to group and batch similar documents to help expedite review and bolster the consistency of document review assessments.
What is Near Duplicate Identification?
"Near duplicates" are documents that have been identified and grouped based on their text similarity. Each document belongs to only one near-duplicate set, and each near-duplicate set contains at least one document. Similarity parameters are configurable before running a Near Duplicate analytics set.
There is only one primary feature in near-duplicate identification that is important for batching documents to keep near-duplicate groups together:
- Near Duplicate Group: A Near Duplicate group will contain the full set of documents that have been identified containing similar text-based on similarity thresholds that have been set
Batching Rules for Grouping Similar Content (Near Duplicates) for Reviewers
These are our recommended settings for a source saved search for a near duplicate batch set intended to group similar documents during review:
Saved Search Criteria:
- NDGroup is set
Sort Criteria:
- NDGroup (asc)
- Control Number (asc)
NOTE: Consider using your "Last Modified Date" instead of your Control Number to put your near-duplicate documents in chronological or "version" order
Batch Set Grouping to group the near duplicate documents together:
- Family Field: NDGroup
All other settings can be configured according to the needs of the project.