Understanding Email Threading

IN THIS ARTICLE:

This article explains how the Email Threading algorithm works within Lighthouse Analytics and provides helpful definitions to assist in making informed decisions when configuring Email Threading settings.

What is Email Threading?

An "email thread" refers to a group of emails and attachments that belong to the same conversation. The Email Threading algorithm groups these documents into threads and identifies which documents contain unique content. Documents that do not contain unique content can be removed or excluded to reduce the review set.

There are three key concepts to understand in relation to email threading:

  • Inclusive Documents
  • Duplicate Spares
  • Unique Content

Email Detection

Lighthouse Analytics identifies emails by scanning for email headers in the extracted text of documents. For an email to be detected and included in the threading process, a valid "From" value must be present in the top header of the document text. If a valid header is not detected, Lighthouse Analytics can optionally rely on the presence of "From" and "Sent Date" values in mapped document metadata fields to qualify the document as an email and include it in the threading process.

Unique Content

Unique Content refers to documents, emails, or attachments that should not be excluded from a review set because they contain content that is not present elsewhere. Documents that are marked as having Unique Content (i.e., Unique Content = Yes ) are retained in the review set, while those without unique content can be safely removed.

Relationship Between Inclusive Documents, Duplicate Spares, and Unique Content:

Inclusive Duplicate Spare Unique Content Remove from Reduced Set
Yes Yes No Safe to remove
Yes No Yes Should not be removed
No Yes No Safe to remove
No No No Safe to remove

Inclusive and Non-Inclusive Documents

Inclusive documents contain unique content not found elsewhere in the thread, making them essential to retain in a reduced review set. Non-inclusive documents lack unique content and can be excluded, thus reducing the volume of documents without losing any necessary information.

Inclusive Reasons

There are six reasons why a document might be marked as inclusive by the Email Threading algorithm:

  1. Last In Time: The last email in each thread branch is marked as inclusive, as it contains the latest content of the entire branch.
  2. Has Attachment: Emails with unique attachments are marked as inclusive, ensuring that unique content within the attachments is not lost.
  3. Attachment: If an email is inclusive due to having an attachment, the attachment itself is also marked as inclusive.
  4. Inline Changes: Emails with content changes not present in the last-in-time email are marked inclusive to retain unique content.
    1. Example: An original email includes specific names and dates. A reply email edits those details. Both emails are marked as inclusive because each contains unique content not present in the other.
  5. Inferred: When the algorithm suspects an email belongs to a thread but cannot confirm it, the email is marked as inclusive for further review.
    1. Example: An email thread is forwarded, and parts of the original conversation are deleted. The algorithm infers the remaining parts belong to the original thread and includes them for review.
  6. Recipient Difference: When recipient differences exist across duplicate spares, both the pivot and the duplicate spare containing unique recipient information are marked as inclusive.
    1. Example: Two versions of the same email are sent to different CC recipients. Both versions are retained in the review set to ensure all recipient information is available.

Duplicate Spares

Duplicate spares are emails that contain the same content as another email in the same thread but may differ slightly in formatting or other minor aspects. One email in each duplicate grouping is marked as the pivot (i.e., Is Duplicate = No ), and the rest are marked as duplicate spares (i.e., Is Duplicate = Yes ). Duplicate spares can be removed without losing any unique content.

Criteria for Identifying Duplicate Spares:

  • Email From: Must match the sender in the pivot email.
  • Email Recipients: Recipients in the To, CC, and BCC fields must match those in the pivot email (if "Analyze recipient differences" is turned on).
  • Sent Date: Must match the sent date of the pivot email, with a variance allowance of 24 hours.
  • Attachments: Must match the attachments in the pivot email based on MD5-Hash.
  • Email Body: Must match the content of the pivot email, including all message segments in the thread.

Pivot Selection for Duplicate Groupings

The algorithm selects the email with the most comprehensive information as the pivot in a duplicate grouping. If BCC information is present in one email but not others, the email with the BCC information is selected as the pivot. If all emails in a group are identical, the email with the lowest Document Identifier/Control Number is selected as the pivot.

Duplicates vs. Duplicate Spares

Not all email duplicates are marked as "spares." If an email contains unique content (e.g., unique recipients, attachments, or inline changes), it is not marked as a spare and will be retained in the review set.

Email Thread Viewer

After emails are threaded and the results are imported into Relativity, the Email Thread Viewer can be used to review the threading results.

Foreign Language Support

The Email Threading algorithm relies on email headers to determine which emails belong to the same thread. While the algorithm currently supports English, German, Spanish, and French email headers, emails with headers in other languages may not thread accurately.