Inclusive and Duplicate Spares Explained

IN THIS ARTICLE:

Email Threading Details explains how email threading works and includes helpful definitions to make email threading settings decisions. An “email thread” is a group of emails and attachments that all belong to the same “email chain” or “email conversation.” MA’s email threading algorithm groups documents into email threads and identifies which documents contain unique content. Documents that do not contain unique content can be removed or excluded. MA first identifies email documents in the threading set by scanning for email headers in the extracted text. Threading requires a valid From value in the top header of the document text to detect an email and then submit it to the threading process. If a valid header is not detected, MA can optionally rely on From and Sent date values in mapped document metadata fields to qualify the document as an email message and then submit the document to the threading process. See your MA Administrator about using the Configuration Setup feature to set up metadata fields. Three key concepts to understand that are related to email threading: Inclusive Documents, Duplicate Spares, Unique Content. 

Unique content

Unique content refers to documents, emails, or attachments that can or cannot be excluded from a document set to reduce the volume without losing unique content. Documents that are both inclusive for one or more of the six reasons listed below and that are not duplicate spares are treated as documents with unique content. Documents where the Unique Content value is ‘Yes’ are not removed from a reduced set. Documents where the Unique Content value is ‘No’ are safe to remove from a reduced set. The following table represents the relationships between inclusive documents, duplicate spares, unique content, and removing a document from a reduced set.

Inclusive

Yes, a Duplicate Spare 

No, Not a Duplicate Spare 

     Yes

Unique Content = No
Safe to remove or exclude from a reduced set

Unique Content = Yes
Should not be removed or excluded from a reduced set

     No

Unique Content = No
Safe to remove or exclude from a reduced set

Unique Content = No
Safe to remove or exclude from a reduced set

Inclusive, non-Inclusive documents

Inclusive relates to a document’s location in an email thread in relation to other documents elsewhere in the same thread. An inclusive document, either email or attachment, contains unique content not present in documents located elsewhere in the same thread.

Non-inclusive documents do not contain unique content, so they do not need to be included in a reduced set. Removing non-inclusive documents from a document set reduces the volume without losing unique content.

Inclusive reasons

There are six reasons a document can be identified as inclusive by MA’s email threading. These reasons are not mutually exclusive, and a single document may be inclusive for more than one of the reasons listed below:

  1. Last In Time: Each email thread contains at least one “branch,” and the last email in each branch (the email at the “end” or “top” of a branch) is marked with the inclusive reason 'Last In Time.' Last-in-time emails contain the content of the last email in the branch and the content of all the emails below in the same branch. This allows many (but not all) of the emails “down the branch” to be marked as non-inclusive and excluded from a reduced set. Every email thread group has at least one document marked as a last-in-time inclusive email.
  2. Has Attachment: Email threads are analyzed to identify unique attachments based on a file’s mapped MD5 hash.  Emails containing a unique attachment are marked with the inclusive reason 'Has Attachment.'  In most cases, emails with attachments are marked as inclusive, regardless of their location in the branch. Although the content of an email that is not last-in-time is usually present in the last-in-time email, the content of this email’s attachment is usually not present in the last-in-time email. Marking emails with attachments as inclusive helps ensure the unique content in the attachments is not lost in a reduced set. Sometimes, different emails in the same thread contain the same attachments. In these cases, only the latest instance of the email with this attachment will be marked inclusive for ‘Has Attachment.’ Example 1: Attachment A is forwarded. Attachment A now exists twice in the same email branch: first attached to Email 1, and then again attached to Email 2 after it is forwarded. Email 1 will not be marked inclusive for ‘Has Attachment.’ However, Email 2 will be marked inclusive for ‘Has Attachment.’ Example 2: An email thread contains two branches. Two emails in this thread are marked inclusive for ‘Last In Time’: Email 3 and Email 4. Although Emails 3 and 4 occur at the end of two separate branches, they both contain the exact same attachment: Attachment B. Email 3 will be marked inclusive for ‘Has Attachment’ in addition to ‘Last In Time,’ but Email 4 will only be marked inclusive for ‘Last in Time.’ Because Attachment B exists twice in the same email thread, it will only be marked inclusive in one instance.
  3. Attachment: If an email is marked inclusive for ‘Has Attachment’ (see above), the unique attachments that triggered this inclusive reason are marked with the inclusive reason ‘Attachment.’ In most cases, if an email is inclusive for ‘Has Attachment,’ all of its attachments will also be inclusive for ‘Attachment.' There are some cases where an email can be inclusive for ‘Has Attachment,’ but not all of its attachments are inclusive for ‘Attachment.’ In other cases, an email can be inclusive for reasons other than ‘Has Attachment,’ and its attachments are not inclusive. Example 1: Email 1 contains Attachment A and B. Email 1 is forwarded, but when it is forwarded, Attachment A is removed. Email 2 only contains Attachment B. Email 1 is inclusive for ‘Has Attachment,’ but only one of its attachments (Attachment B) is inclusive for ‘Attachment’ (Attachment A is non-inclusive here). Email 2 is inclusive for ‘Last In Time’ and ‘Has Attachment,’ and its attachment (Attachment A) is inclusive for ‘Attachment.’ Example 2: An email thread contains two branches, and the two last-in-time emails contain the exact same attachment (see example 2 above for ‘Has Attachment’). Email 3 in this thread is marked inclusive for both ‘Last In Time’ and ‘Has Attachment,’ and Email 3’s attachment is marked inclusive for ‘Attachment.’ Email 4 in this thread is only marked inclusive for ‘Last In Time,’ and Email 4’s attachment is marked non-inclusive.
  4. Inline Changes: In some situations, the content may exist in an email earlier in the branch that no longer exists in the last-in-time email in the branch. These emails are marked with the inclusive reason 'Inline Change.' This helps ensure their unique content is not lost in a reduced set. Example 1: James sends an email to Carla. In the body of the email James includes language for a press release, and above this language, he writes: “Can you review the draft press release below and approve it before we send it out?” Carla reads the email and replies: “Overall the press release language looks good, but I made some edits to the product names and release dates. Please see my edits below in red.” Carla then makes changes to the press release text in the original email sent by James, and she sends this reply back to James. Although Carla’s reply is the last-in-time email and is marked inclusive, this reply does not contain the original product names and the original release dates in the first-in-time email. Therefore, the first-in-time email is also marked as inclusive because it contains content not present in the last-in-time email due to “inline changes.” Example 2: An email is sent with four attachments. In the first-in-time email, the four attachment names are in the extracted text at the bottom of the email. Robert replies to this email.  Robert’s reply is the last-in-time email and is marked inclusive. However, due to formatting differences in the extracted text, the four attachment names at the bottom of the branch are no longer present in extracted text in the last-in-time email reply. Therefore, the first-in-time email is marked as inclusive because it contains content not present in the last-in-time email due to "inline changes."
  5. Inferred: In some situations, it may be difficult for MA email threading to determine whether or not an email belongs to a thread. The email format may not match the expected format of the thread, but other aspects of the email may strongly suggest it belongs to the thread. In these situations, the email is included as part of the thread, even though the inclusion may be somewhat of a “best guess.” These emails are marked with the inclusive reason 'Inferred' so that they remain in a reduced set, and a reviewer can assess whether or not the inferred match is accurate. Example: Carla forwards an email conversation to Robert. This email conversation is a long branch with eight different email segments/replies. Carla wants Robert to see the last three email replies but not the first five emails below. So she deletes the first five emails in the branch before forwarding the email conversation to Robert. This email no longer conforms to expected threading formats, but due to similarities in the top three email replies, it is matched with the thread that contains eight different email segments/replies. This email is marked with the inclusive reason 'Inferred.'
  6. Recipient Difference: When the “Analyze recipient differences” is set to on in the email threading settings, the algorithm compares recipient differences in the To, CC, and BCC fields across a duplicate spare grouping. If differences are identified across the duplicate spare set that would cause recipient information to be obscured in the selected Duplicate Spare pivot, the pivot and the Duplicate Spare containing the otherwise obscured information are both marked with the inclusive reason 'Recipient Difference.' Example: George sends an email to two recipients: Marianne in the To field and John in the CC field. The second version of this email is identified as a duplicate by the email threading algorithm. The second version is nearly identical except it has a different recipient in the CC field. In the second version of this email, Adie is in the CC field (and John is not), but everything else is the same as the first version. Both emails will be marked inclusive for 'Recipient Difference' and both will be marked No for 'Is Duplicate.'

Duplicate Spares

Duplicate spares" are emails that contain the same content and are located in the same thread location as another email. They are not necessarily exact duplicates (i.e., MD5-hash) because they may contain slight differences in formatting and white space. However, for the most part, their text is duplicative. When two are more emails are identified as belonging to the same duplicate grouping, one email in the group is identified as the pivot (i.e., Is Duplicate=No), and in most cases, all other emails in the group are marked as duplicate spares (Is Duplicate=Yes). Duplicate spares can be removed or “suppressed” without losing any unique content. The following properties are used to identify if an email is marked as a Duplicate Spare:

  • Email From: The sender in the duplicate spare must match the sender in the pivot email. Senders are matched based on MA’s name normalization output. This allows for variations in the same sender’s name and/or email address across different documents.
  • Email Recipients: When the “Analyze recipient differences” is set to On in the email threading settings, all recipients in the To, CC, and BCC fields in the duplicate spare must be present in the To, CC, and BCC fields in the pivot email. If “Analyze recipient differences” is turned off, recipients in the To and CC fields will be ignored during duplicate spare analysis. Recipients are matched based on MA’s name normalization output. This allows for variations in the same recipient’s name and/or email address across different documents.
  • Sent Date: The sent date in the duplicate spare must match the sent date in the pivot email, allowing for a time variance of 24 hours to account for different time zones.
  • Attachments: All attachments in the duplicate spare must be present and match the attachments in the pivot email. Attachment comparison and matching is based on the MD5-Hash field mapped in the Configuration Setup screen.
  • Email Body: The duplicate spare's email body must match the pivot email body. This includes the content in the top-level message and the content in any message segments down the branch.

Pivot selection for duplicate groupings

To determine which email in a duplicate grouping should be the pivot, the email threading algorithm looks for the email with the most comprehensive information. In some cases, a duplicate spare email may contain some but not all of the information present in the pivot. In most situations, this is due to differences in the BCC field.

BCC Consideration: If an email in a duplicate grouping contains BCC information, this email will be selected as the pivot. The other emails in the duplicate grouping without BCC information (or with the same BCC information as the pivot) will be marked as duplicate spares (assuming they meet all the other duplicate spare criteria). In other situations, the algorithm may detect differences in the To or CC fields (when the “Analyze recipient differences” is set to "On"),  or differences in the attachments. If a duplicate email contains a subset of the recipients or attachments present in the pivot, it will be marked as a duplicate spare. The pivot will always contain all the recipient information and all the attachments present in all the duplicate spares, ensuring that no unique content is lost if duplicate spares are removed/suppressed. If all information is identical across all emails in a duplicate group, the email with the lowest Document Identifier/Control Number will be selected as the pivot.

Duplicates vs. duplicates spares

Not all email duplicates are “spares.” A duplicate spare is always marked as “Is Duplicate=Yes,” and the pivot in the duplicate group is marked as “Is Duplicate=No.” However, in some situations, more than one email may be marked as “Is Duplicate=No” in the same email duplicate group. Suppose there are duplicate emails in a grouping containing unique content not present in the pivot (unique recipients, unique attachments, or minor text differences due to “inline changes”). In that case, these emails will not be marked as spares and marked as Inclusive for one or more reasons: Recipient Difference, Has Attachments, and Inline Changes.

Email Thread Viewer

After emails are threaded and the results are overlaid into Relativity, use the MA Thread Viewer to review the results.

Foreign language support

MA Email Threading relies on the email headers to determine which emails belong to the same thread. If the email headers are in a foreign language, this may hinder the algorithm's ability to create accurate results. MA email threading currently supports English, German, Spanish, and French email headers.