Configuring Near and Text Duplicate Settings

IN THIS ARTICLE:

Near Duplicate (ND) identifies and groups near-duplicate documents based on extracted text similarity. Each document belongs to only one ND set, and each ND set contains at least one document. Similarity parameters are configurable before running. By default, the document that is the median similarity to the rest in the set is assigned as the pivot. Text Duplicate (TD) identifies and groups documents with identical extracted text, even documents that would never hash together. 

What is Near Duplicate Identification?

Near duplicates are documents that have been identified and grouped based on their text similarity. Each document belongs to only one near-duplicate set and each near-duplicate set contains at least one document. Similarity parameters are configurable before running a Near Duplicate analytics set. 

There are two primary features in near-duplicate identification that are important to utilize when working with near-duplicate documents:

  • Near Duplicate Group
  • Near Duplicate Pivot

Near Duplicate Group

MA uses a median similarity threshold which adjusts the percentage when comparing very large docs together or very small docs together to increase the accuracy of the grouping results over very large and very small documents. For example, if your median similarity is 80% but you are comparing two documents for similarities that are very small, the comparison threshold could be increased to 85%. If you are comparing two documents that are very large, the comparison threshold could be decreased to 75%. This allows for the standard fixed similarity percentage comparison for most documents but provides a more balanced and nuanced comparison where needed.

A document can be compared to any document in a near-dupe group for similarity, not just to the largest doc or the pivot. This helps ensure an increase in the comprehensiveness of the near duplicate, particularly during incremental runs, increasing the chances that a new document will be grouped with other documents that share close text similarity.

Near Duplicate Pivot

Once an ND group is established, MA chooses the document with the median size as the pivot, as this tends to be the document that shares the most similarities with all other documents in the group. This pivot assignment does not change once assigned, even in the case of an incremental run on a Near Duplicate set.

What is Text Duplicate Identification?

Text duplicates are documents that have been identified and grouped based on the matching of text information, even if the native documents themselves would not hash identically due to native file formatting differences or white space.  This makes text duplicates more reliable for use with coding propagation and isolating pivots to review the full amount of unique information a text duplicate group has to offer. Each document belongs to only one text duplicate set, and each text duplicate set contains at least one document.  Unlike the settings for near-duplicate identification, Text Duplicate settings do not require any decisions regarding similarity or file comparison.

There are two primary features in text duplicate identification that are important when reducing review volumes to focus only on text duplicate pivot documents:

  • Text Duplicate Pivot
  • Text Duplicate Group

Text Duplicate Group

MA uses a text comparison methodology that identifies and groups documents with 100% matching text after you exclude formatting and white space. This enables the grouping of documents that are otherwise exact duplicates, even if the native files extracted text would not match based on a hash algorithm, such as MD5.

Text Duplicate Pivot

Once a TD group is established, the first document encountered from the TD group is assigned the role of the pivot. This pivot assignment does not change once assigned, even in the case of an incremental run on a Text Duplicate set.

WHO CAN PERFORM:

You must have the permissions, Create and manage sets and Overlay results to create and run sets and overlay the results into Relativity.

Near Duplicate (ND) Settings

Automatically Overlay Results

When the set completes, the overlay Near Duplicate step is automatically added to the queue. Users should choose to auto-overlay only when they already have confirmation/permission from a client to use the ND results. This can save time for cases when a job finishes after hours.

Restore Defaults icon

Rolls back any setting changes made for Near Duplicate

MEDIAN SIMILARITY THRESHOLD

Specify the similarity threshold for near-duplicate analysis. This threshold will create near-duplicate groups of documents within that percentage of similarity. The default is 90%.

Exclude smaller than

Excludes small documents. After you turn on, Exclude smaller than, enter the size of the documents to be excluded. The default is 0 (zero) Select if the size of the documents should be in megabytes (MB) or kilobytes (KB).

Exclude larger than

Excludes very large documents. After you turn on, Exclude larger than enter the size of the documents to be excluded. Select if the size of the documents should be in megabytes (MB) or kilobytes (KB). The default is larger than 16MB.

Exclude emails

When toggled on, excludes emails from duplicate analysis. The setting is disabled by default.

Exclude Attachments

When toggled on, excludes attachments from duplicate analysis. This setting is disabled by default.

Text Duplicate (TD) settings

Automatically Overlay Results

When the set completes, the overlay Text Duplicate step is automatically added to the queue. You should choose to auto-overlay only when they already have confirmation/permission from a client to use the TD results. This can save time for cases when a job finishes after hours.

Restore Defaults

Rolls back any setting changes made for Text Duplicate

Back to top

IN THIS ARTICLE:

The Near Duplicate (ND) and Text Duplicate (TD) algorithms identify and group documents based on text similarity. Near Duplicate identifies and groups near-duplicate documents, with similarity parameters configurable before running. Text Duplicate identifies and groups documents with identical extracted text, even when the documents differ in file format or other non-textual elements. This article details the settings available for configuring both algorithms to optimize document analysis.

WHO CAN PERFORM:

Users with the Create and manage sets permission can create and edit analytics sets. The Overlay results permission allows users to overlay results into Relativity. When the overlay permission is disabled, it disables the auto-overlay toggle and the manual overlay buttons.

Near Duplicate Identification Overview

Near duplicates are documents that have been grouped based on their text similarity. Each document belongs to only one near-duplicate set, and each set contains at least one document. There are two primary features in near-duplicate identification:

  • Near Duplicate Group: Documents are grouped based on a median similarity threshold. This threshold is applied consistently across documents to determine their inclusion in a near-duplicate group. The similarity threshold is configurable by the user before running the analysis.
    • Dynamic Adjustment: While the similarity threshold is set as a fixed percentage, the system dynamically adjusts this percentage when comparing documents of extreme sizes. This adjustment enhances the accuracy of grouping by increasing the threshold for small documents and decreasing it for large documents.
    • Comprehensive Comparison: A document within a near-duplicate group can be compared to any other document in the group, not just the largest document or the pivot. This approach ensures a more thorough comparison, particularly during incremental runs, increasing the likelihood that new documents will be accurately grouped with similar documents.
  • Near Duplicate Pivot: The document with the median size in the ND group is assigned as the pivot. This document typically shares the most similarities with the other documents in the group.
    • Pivot Stability: The pivot assignment remains consistent, even when new documents are added to the set or the set is re-run as part of an incremental analysis. This stability helps maintain the integrity of the groupings and ensures that comparisons remain consistent over time.

Near Duplicate (ND) Settings

Here’s a breakdown of the settings available when configuring the Near Duplicate algorithm:

  • Automatically Overlay Results: When the set completes, the overlay Near Duplicate step is automatically added to the queue. If users are concerned about the number of fields being added to the Relativity workspace, they may want to turn this option off. This can save time for cases when a job finishes, especially if it finishes after hours.
  • Reset: Rolls back any changes made to the Near Duplicate settings, restoring them to their original default values.
  • Median Similarity Threshold: Specify the similarity threshold for near-duplicate analysis. This threshold will create near-duplicate groups of documents within that percentage of similarity. The default is 90%.
  • Exclude Smaller Than: Excludes small documents from the analysis. After enabling this option, enter the size of the documents to be excluded. The default is 0 (zero). Select whether the size should be in megabytes (MB) or kilobytes (KB).
  • Exclude Larger Than: Excludes very large documents from the analysis. After enabling this option, enter the size of the documents to be excluded. Select whether the size should be in megabytes (MB) or kilobytes (KB). The default is larger than 16MB.
  • Exclude Emails: When toggled on, excludes emails from duplicate analysis. This setting is disabled by default.
  • Exclude Attachments: When toggled on, excludes attachments from duplicate analysis. This setting is disabled by default.

Text Duplicate Identification Overview

Text duplicates are documents grouped based on 100% matching text, excluding formatting and white space. This ensures grouping even when native file formats differ. Text duplicates are useful for coding propagation and isolating pivot documents for review. Each document belongs to only one text duplicate set.

  • Text Duplicate Group: Documents are grouped based on identical text after excluding non-textual elements, such as white space and formatting. This enables the grouping of documents that are otherwise exact duplicates, even if the native files extracted text would not match based on a hash algorithm, such as MD5.
  • Text Duplicate Pivot: The first document encountered in a TD group is assigned as the pivot, remaining unchanged even in incremental runs.

Text Duplicate (TD) Settings

Here’s a breakdown of the settings available when configuring the Text Duplicate algorithm:

  • Automatically Overlay Results: When the set completes, the overlay Text Duplicate step is automatically added to the queue. If users are concerned about the number of fields being added to the Relativity workspace, they may want to turn this option off. This can save time for cases when a job finishes, especially if it finishes after hours.
  • Reset: Rolls back any changes made to the Text Duplicate settings, restoring them to their original default values.
  • Exclude Smaller Than: Excludes small documents from the analysis. After enabling this option, enter the size of the documents to be excluded. The default is 0 (zero). Select whether the size should be in megabytes (MB) or kilobytes (KB).
  • Exclude Larger Than: Excludes very large documents from the analysis. After enabling this option, enter the size of the documents to be excluded. Select whether the size should be in megabytes (MB) or kilobytes (KB). The default is larger than 16MB.