Near and Text Duplicate: Settings

IN THIS ARTICLE:

Near Duplicate (ND) identifies and groups near-duplicate documents based on extracted text similarity. Each document belongs to only one ND set, and each ND set contains at least one document. Similarity parameters are configurable before running. By default, the document that is the median similarity to the rest in the set is assigned as the pivot. Text Duplicate (TD) identifies and groups documents with identical extracted text, even documents that would never hash together. 

What is Near Duplicate Identification?

Near duplicates are documents that have been identified and grouped based on their text similarity. Each document belongs to only one near-duplicate set and each near-duplicate set contains at least one document. Similarity parameters are configurable before running a Near Duplicate analytics set. 

There are two primary features in near-duplicate identification that are important to utilize when working with near-duplicate documents:

  • Near Duplicate Group
  • Near Duplicate Pivot

Near Duplicate Group

MA uses a median similarity threshold which adjusts the percentage when comparing very large docs together or very small docs together to increase the accuracy of the grouping results over very large and very small documents. For example, if your median similarity is 80% but you are comparing two documents for similarities that are very small, the comparison threshold could be increased to 85%. If you are comparing two documents that are very large, the comparison threshold could be decreased to 75%. This allows for the standard fixed similarity percentage comparison for most documents but provides a more balanced and nuanced comparison where needed.

A document can be compared to any document in a near-dupe group for similarity, not just to the largest doc or the pivot. This helps ensure an increase in the comprehensiveness of the near duplicate, particularly during incremental runs, increasing the chances that a new document will be grouped with other documents that share close text similarity.

Near Duplicate Pivot

Once an ND group is established, MA chooses the document with the median size as the pivot, as this tends to be the document that shares the most similarities with all other documents in the group. This pivot assignment does not change once assigned, even in the case of an incremental run on a Near Duplicate set.

What is Text Duplicate Identification?

Text duplicates are documents that have been identified and grouped based on the matching of text information, even if the native documents themselves would not hash identically due to native file formatting differences or white space.  This makes text duplicates more reliable for use with coding propagation and isolating pivots to review the full amount of unique information a text duplicate group has to offer. Each document belongs to only one text duplicate set, and each text duplicate set contains at least one document.  Unlike the settings for near-duplicate identification, Text Duplicate settings do not require any decisions regarding similarity or file comparison.

There are two primary features in text duplicate identification that are important when reducing review volumes to focus only on text duplicate pivot documents:

  • Text Duplicate Pivot
  • Text Duplicate Group

Text Duplicate Group

MA uses a text comparison methodology that identifies and groups documents with 100% matching text after you exclude formatting and white space. This enables the grouping of documents that are otherwise exact duplicates, even if the native files extracted text would not match based on a hash algorithm, such as MD5.

Text Duplicate Pivot

Once a TD group is established, the first document encountered from the TD group is assigned the role of the pivot. This pivot assignment does not change once assigned, even in the case of an incremental run on a Text Duplicate set.

WHO CAN PERFORM:

You must have the permissions, Create and manage sets and Overlay results to create and run sets and overlay the results into Relativity.

Near Duplicate (ND) Settings

Automatically Overlay Results

When the set completes, the overlay Near Duplicate step is automatically added to the queue. Users should choose to auto-overlay only when they already have confirmation/permission from a client to use the ND results. This can save time for cases when a job finishes after hours.

Restore Defaults icon

Rolls back any setting changes made for Near Duplicate

MEDIAN SIMILARITY THRESHOLD

Specify the similarity threshold for near-duplicate analysis. This threshold will create near-duplicate groups of documents within that percentage of similarity. The default is 90%.

Exclude smaller than

Excludes small documents. After you turn on, Exclude smaller than, enter the size of the documents to be excluded. The default is 0 (zero) Select if the size of the documents should be in megabytes (MB) or kilobytes (KB).

Exclude larger than

Excludes very large documents. After you turn on, Exclude larger than enter the size of the documents to be excluded. Select if the size of the documents should be in megabytes (MB) or kilobytes (KB). The default is larger than 16MB.

Exclude emails

When toggled on, excludes emails from duplicate analysis. The setting is disabled by default.

Exclude Attachments

When toggled on, excludes attachments from duplicate analysis. This setting is disabled by default.

Text Duplicate (TD) settings

Automatically Overlay Results

When the set completes, the overlay Text Duplicate step is automatically added to the queue. You should choose to auto-overlay only when they already have confirmation/permission from a client to use the TD results. This can save time for cases when a job finishes after hours.

Restore Defaults

Rolls back any setting changes made for Text Duplicate

Back to top