General FAQs

The most requested how-to articles by users are listed below.

General Information and Definitions:

Is there an LHA glossary?
- Yes! One of the quickest ways to acquaint yourself with new software is to review an available glossary. Find a table of Terms followed by a table of Acronyms that describes potentially unfamiliar terms in this article.
Is there an icon legend for LHA?
- Lighthouse Analytics uses a plethora of icons so we created an Icon Legend to explain each icon's purpose. Become a power user in no time by consulting the Icon Legend article in the Knowledge Base.

LHA Analytic Sets and Outputs:

What are LHA ID numbers?
- System-generated ID numbers are assigned after a set has been run. There are two ID types:
- Every full or partial set run is given a unique ID by the system. A set's algorithm identification number used in the results and reports. The ID format for sub-algorithms is <SUB-ALGORITHM CODE>-<Set ID#>. IDs are unique to the set and unique within the Relativity workspace. Here is the list of sub-algorithms codes:
  - ET for Email Threading
  - TD for Text Duplicate
  - ND for Near Duplicate
  - NN for Name Normalization and Entity Analysis
What are the four LHA field types?
- Metadata fields: Metadata fields contain information about the document in its original form, such as extension, size, and other relevant details.
- Organizational fields: Organizational fields, which include those generated by LHA during a run, fall under the organizational category. For example, Custodian is not inherent to the original file but is assigned to the document for organizational purposes, such as tracking the number of custodians and listing them in a report. All fields, whether metadata or organizational, are associated with the document level.
- Relational fields: Relational fields consist of Group IDs that facilitate the grouping of multiple documents based on a shared ID. An illustration of a relational field is "MA Normalized BCC."
- Reflexive fields: Reflexive fields are established when an "object" field is populated with references to other objects. For instance, if the results group field directs to a specific row containing all results of that group, these can be referenced and mirrored as "reflexive" fields. Such fields are identified by a double colon.
Where can I find information about exceptions?
- You can find information about exceptions on the Set Listing page. Click on Completed under the Status column, and a popup will display the Job History Details. Any exceptions will be listed in the Exception column.
Can I view and analyze documents ignored or skipped during an analytics set run?
- Once the results have been overlaid, the numbers in the report for errors, ignored, or skipped documents become clickable. One recommended workflow is to click the number from the report and then tag documents from the results of that saved search, thereby moving them into a separate workflow from that for documents that were successfully analyzed.

Email Threading Functionality & Settings:

What is Email Threading?
- Email threading refers to the process of grouping related emails together. LHA's algorithm is designed to identify unique content, which aids in the organization and streamlining of email chains. This algorithm is capable of recognizing inclusive documents, eliminating duplicate emails, and pinpointing unique content to enhance the efficiency of email analysis. For further insights on this topic, we recommend reading the article Inclusive and Duplicate Spares Explained.
How does LHA Email Threading handle time zones?
- The threading algorithm allows for 24 hours of difference in the date and time portion of the logic.
How does LHA's Email Threading handle a mass email assigned to different custodians?
- When multiple custodians receive the same mass email with only their email address in the "To" field, LHA identifies them as duplicates. The 'Last email or end-of-branch emails are marked as inclusive. One email is marked as not a duplicate while others are marked as duplicates. Note that differences in 'To' values are currently not considered.
How does LHA's Email Threading handle mass email instances assigned to different custodians?
- If more than one custodian has a copy of a mass email, each with only their own email address in the 'To' field, these emails are identified as duplicate emails. If they are the last in time or at the end of a branch, they are marked Inclusive for ‘Last in Time.’ Only one of these emails is marked ‘Is Duplicate=No’ (the first email analyzed) and the other emails are marked ‘Is Duplicate=Yes.’ Currently, the difference in ‘To’ values is not accounted for.
How are replies from cell phones with incomplete headers (e.g. only "From") handled?
- The subsequent email (that contains the email body content below but not all the To, CC, and BCC info) will be marked inclusive if it is the last in time or the end of a branch (or if it contains inline changes and/or unique attachments). However, the separate document that represents the original email below (before the cell phone reply) will not be marked inclusive (assuming it does not have any inline changes or unique attachments) even though it has To, CC, and BCC information that is not present in the last in time email. When analyzing emails as being incremental pieces of the same branch and for inclusiveness, we do not consider To, CC, or BCC. This is in line with the comparison logic found in many other threading applications as well. However, if you use the thread viewer on a set with threading run over all documents in the set (not just over the inclusive/unique documents), the To, CC, and BCC information down the branch will still be preserved for review, and you will still be reviewing less content overall than if you reviewed inclusive/unique documents alone.
How are attachments accounted for in determining inclusiveness?
- Attachments are compared by MD5 hash, and if they occur more than once in an email thread, only the final instance of the attachment will be considered inclusive.
What is an inferred match?
- The inferred match is designed to include an email as part of a thread when some of the email segments below have been removed or truncated, and therefore, the email segments and body content do not match up against other emails in the same thread. In some cases, we have enough information to make an inferred match, and this email is then added to the thread and marked inclusive with the reason of Inferred. Currently, less than 1% of our results end up being marked inclusive for the reason of ‘Inferred.’ Sometimes, emails are marked as ‘Inferred’ when the more appropriate inclusive reason would actually be ‘Inline Changes,’ and we are still fine-tuning this logic.
Can I see Field Tag coding conflicts between visible nodes and corresponding duplicate spares or attachments?
- The Field Tag overlay indicates the applicable field tag applied to the node document only. It does not present the field tags applied to duplicate spares or attachments. To confirm the tags or any potential conflicts, you can equip the Relativity relational view to indicate desired tagging fields alongside the thread set documents, or navigate to the duplicate spare or attachment documents directly. In applying tags using the Thread Viewer, our mass edit layout allows you to apply, or selectively not apply, field tags to duplicate spares and attachments, ensuring consistent coding application.
What is "Analyze Recipient Differences" in LHA's Email Threading and when should it be used?
- Analyze Recipient Differences is a setting during an Email Threading run that compares recipient variances in the To, CC, and BCC fields across a duplicate spare email grouping. This feature marks emails as inclusive for "Recipient Difference" when variations are detected in recipient information.
- Why use it?
  - Safety in Suppression: When Analyze Recipient Differences is turned on, more documents are marked as inclusive, and fewer are suppressed. This provides a safety net in suppression that is not offered by competitors.
- When to use it:
  - For Enhanced Inclusiveness: Turn on Analyze Recipient Differences to increase the inclusiveness of documents. This ensures that variations in recipient information are accounted for, reducing the risk of omitting relevant emails from analysis.
- When not to use it:
  - For Increased Suppression: If the goal is to suppress more documents, turning off Analyze Recipient Differences can lead to fewer documents being marked as inclusive. However, this may result in losing the distinction between documents sent to different recipients.
- Note: Analyze Recipient Differences is a default setting in our Email Threading tool and is not offered by competitors. While it may initially seem confusing, it provides valuable functionality for more accurate analysis.
When is content considered unique?
- Unique Content refers to documents (emails or attachments) that can or cannot be excluded from a document set to reduce the volume without losing unique content. Documents that are both inclusive (for one or more of the five reasons listed above) and that are not duplicate spares are treated as documents with “unique content.”
- Documents, where the Unique Content value is ‘Yes’, should not be removed from a reduced set. Documents where the Unique Content value is ‘No’ are safe to remove from a reduced set. The following chart represents the relationships between inclusive documents, duplicate spares, unique content, and removing a document from a reduced set.
- The following properties are used to identify Inclusive Email Documents:
  - Email Body (when an email contains a unique segment, it is considered Inclusive)
  - LHA compares the email segments (email header + email body) in one email against another email to determine if the emails belong in the same thread group and branch (and inclusiveness is determined after thread and branch structure are determined). See below for the email header fields that are used in this comparison.
  - From (field needs to contain the same normalized name)
  - SentDateTime (LHA allows for a time variance of 24 hours)
  - Attachments (LHA matches attachments using the field mapped to MD5 Hash in the Configuration Setup)
  - To, CC, BCC (all recipients) are NOT taken into consideration when determining inclusiveness for Last in Time or Inline Change (this is in line with how Relativity handles the same situation).
  - Note that in the next release, To, CC, and BCC will be taken into consideration for inclusiveness (Header Differences), but only in the top-level header for emails that are potentially duplicate spares (Relativity only compares To in this situation).
  - Email Subject is NOT taken into consideration.
What are the five reasons LHA might tag an email as inclusive?
- “Inclusive” relates to a document’s location in an email thread in relation to other documents located elsewhere in the same thread. An inclusive document (email or attachment) contains unique content that is not present in documents located elsewhere in the same thread.
- Non-inclusive documents do not contain unique content, and therefore, they do not need to be “included” in a reduced set. Removing non-inclusive documents from a document set can help reduce the volume without losing unique content.
- Inclusive Reasons - There are five reasons a document can be identified as inclusive by MA's email threading. These reasons are not mutually exclusive, and a single document may be inclusive for more than one of the five reasons listed below:
  - Last In Time: Each email thread contains at least one “branch,” and the last email in each branch (the email at the “end” or “top” of a branch) is marked as inclusive. Last-in-time emails contain the content of the last email in the branch as well as the content of all the emails below them in the same branch. This allows for many (but not all) of the emails “down the branch” to be marked as non-inclusive and excluded from a reduced set. Every email thread group has at least one document marked as a last-in-time inclusive email.
  - Has Attachment: In most cases, emails with attachments are marked as inclusive, regardless of their location in the branch. Although the content of an email that is not last-in-time is usually present in the last-in-time email, the content of this email’s attachment is usually not present in the last-in-time email. Marking emails with attachments as inclusive helps ensure the unique content in the attachments is not lost in a reduced set. In some cases, different emails in the same branch contain the same attachments. In these cases, only the latest instance of the email with this attachment will be marked inclusive for ‘Has Attachment.’ For example, Attachment A is forwarded. Attachment A now exists twice in the same email branch: first attached to Email 1, and then again attached to Email 2 after it is forwarded. Email 1 will not be marked inclusive for ‘Has Attachment.’ However, Email 2 will be marked inclusive for ‘Has Attachment.’
  - Attachment: If an email is marked inclusive for ‘Has Attachment’ (see above), then this email’s attachments will always be marked inclusive for ‘Attachment,’ regardless of their location in the branch. Marking attachments as inclusive helps ensure their unique content is not lost in a reduced set.
  - Inline Changes: In some situations, the content may exist in an email earlier in the branch that no longer exists in the last email in the branch. Marking these earlier emails as inclusive helps ensure their unique content is not lost in a reduced set.
    - Example 1: James sends an email to Carla. In the body of the email James includes language for a press release, and above this language, he writes: “Can you review the draft press release below and approve it before we send it out?” Carla reads the email and replies: “Overall the press release language looks good, but I made some edits to the product names and release dates. Please see my edits below in red.” Carla then makes changes to the press release text in the original email sent by James, and she sends this reply back to James. Although Carla’s reply is the last-in-time email and is marked inclusive, this reply does not contain the original product names and the original release dates that were in the first-in-time time email. Therefore, the first-in-time email is also marked as inclusive because it contains content not present in the last-in-time email due to “inline changes.”
    - Example 2: An email is sent with four attachments. In the first-in-time email, the four attachment names are present in the text of the body of the email. Robert replies to this email. Robert’s reply is the last-in-time email and is marked inclusive. However, due to formatting differences generated by the email system and/or the processing platform, the four attachment names at the bottom of the branch are no longer present in the last-in-time email reply. Therefore, the first-in-time email is marked as inclusive because it contains content not present in the last-in-time email due to inline changes.
  - Inferred: In some situations, it may be difficult for MA's email threading to determine whether or not an email belongs to a thread. The format of the email may not match the expected format of the thread, but other aspects of the email may strongly suggest it belongs to the thread. In these situations, the email is included as part of the thread, even though the inclusion may be somewhat of a “best guess.” The email is also marked as inclusive so that it remains in a reduced set and a reviewer can assess whether or not the inferred match is accurate.
    - Example: Carla forwards an email conversation to Robert. This email conversation is a long branch with eight different email segments/replies. Carla wants Robert to see the last three email replies but not the first five emails below. So she deletes the first five emails in the branch before forwarding the email conversation to Robert. This email no longer conforms to expected threading formats, but due to similarities in the top three email replies, it is matched with the thread that contains eight different email segments/replies. This email is marked as inclusive and as “inferred.”

Email Thread Viewer Display:

In the Thread Viewer, do Field Tags or Communication Markers have a precedence order, and are they customizable?
- There is no precedence ordering or "trumping" for field tags. Emails can have all four values and markers if they are present on the document. Field tags are in a static order, so once the tags are set to a Relativity Choice, the positioning remains the same. The options for Field Tags are limited to only Relativity Choices today, and the colors are not customizable. Communication Markers are hardcoded to the five values available on people profiles, and the colors are not modifiable. There is a hardcoded precedence order for Communication Markers though: the red flag always trumps the yellow flag.
How large of a thread can the thread viewer display? Are there any limitations or performance considerations for viewing or editing large threads?
- The thread viewer can handle very large threads without any major performance issues. Thread viewer functionality with threads that contain over 1000 documents has been tested. These threads load into the viewer then bulk updates can be performed over every document without any perceptible slow-down. In very large threads, however, many documents tend to be attachments and duplicates, which allows this information to be presented in a compact manner in the thread viewer.
- There is a limitation in using Relativity’s relational pane when performing bulk updates in that you cannot update specific nodes not currently in the view of the bulk update panel. Because of this, we always recommend using the LHA Thread Viewer Mass Edit feature to edit nodes on the thread.

Near/Text Duplicate Document Identification:

Is word order considered for Text Duplicate or Near Duplicate algorithms? And what about noise words, punctuation, white space, or numbers?
- The near-duplicate grouping algorithm does consider word order in grouping documents together, but punctuation and white space are ignored. Noise words and numbers, however, are not ignored.
What is the difference between Near Duplicate and Text Duplicate algorithms, and when should I use each?
- Near Duplicate (ND) and Text Duplicate (TD) are two algorithms used to identify and group similar documents, but they have distinct functionalities.
  - Near Duplicate (ND)
    - Functionality: ND identifies and groups documents based on their text similarity, grouping together documents that are nearly identical. It employs a median-sized document as the pivot, ensuring comprehensive comparison and accurate grouping.
  - Text Duplicate (TD)
    - Functionality: TD identifies and groups documents with identical text content, even if they would not hash identically due to formatting differences or white space.
  - When to use each:
    - Near Duplicate: Use ND when you need to identify documents with similar content, allowing for variations in text.
    - Text Duplicate: Choose TD when you require precise identification of documents with identical text content, irrespective of formatting differences.
- Note: While Relativity Analytics uses Near Duplicate with a threshold set at 100% as an equivalent to Text Duplicate, this method may not offer the same level of accuracy and flexibility as a dedicated Text Duplicate algorithm like the one provided by Lighthouse Analytics.
Why isn't a 100% similarity in Near Duplicate the same as Text Duplicate?
- While a 100% similarity in Near Duplicate might seem equivalent to Text Duplicate, there's a key distinction. Lighthouse Analytics utilizes a median similarity threshold for Near Duplicate, meaning that while a document might be 100% similar to the pivot, it may not exhibit the same level of similarity to all other documents in the group. On the other hand, Text Duplicate guarantees that groups are comprised of documents that are 100% text duplicates, regardless of their relationship to a pivot document. This ensures precise identification of exact text duplicates across document formats, a feature not provided by Near Duplicate algorithms alone.

Incremental Builds and Changes:

What's the difference between an aggregate and an incremental report?
- After you run an Incremental run and click into the report, a toggle in the top-right corner switches between Aggregate and Incremental report views.
- When the toggle switch is on Aggregate, the summary cards and full reports show the normal report information that already exists. When Incremental, all the summary cards, and full reports show new incremental report numbers and calculations only for the newly added documents.
Do incremental builds impact pre-existing results?
- Yes, incremental builds can modify pre-existing results.
- For example, if an incremental build for Email Threading is run and a previously Inclusive document would become Non-Inclusive, is that document updated even though it's not part of the newly added incremental set?
- Incremental builds can be set up in one of two ways:
  - The first and most common method is to update the source saved search so that it includes additional documents. This can happen naturally if the saved search refers to coding fields.
  - The second method is to use a completely new search that includes additional documents. In either case, the system grabs all previously un-analyzed documents and builds incrementally upon the previous results.
- Fields that are potentially impacted on existing results are Inclusive Reason, Is Inclusive, Count, Threading ID, and Thread Display. Documents are not removed from groups and re-grouped, but groups can increase in size due to the additional documents added to them.
- Incremental builds for Near Duplicate and Text Duplicate may also add documents to the groups, thereby impacting the count fields, but no other fields are impacted for these algorithms. For PII Identification and Name Normalization, incremental builds do not impact prior results.
- The Incremental Build report for a threading set is useful in determining changes to pre-existing groups. The report provides an accounting of what prior threads have new emails added to them, which new thread conversations have been added that were not in the prior sets, and which documents were inclusive, but no longer are due to the documents added.
- If the incremental results have been overlaid, directly access the new or impacted documents by clicking on the document count links in the set’s incremental report.
How does the Near Duplicate algorithm work with incremental builds?
- During an incremental run, the newly added documents are compared to the pre-existing near-duplicate groups and to one another. A new document can be added to a pre-existing thread group if it meets the similarity threshold specified in the algorithm setup. If a new document does not meet the similarity threshold against documents in any pre-existing groups, then it becomes a member of a newly created group that can contain one or more newly added documents.
During an incremental threading run, what scenarios result in documents previously not unique (not inclusive) change status to unique, (inclusive)?
- There are three main reasons a document can change from not unique to unique after an incremental run:
  - The most common reason a document can become unique after an incremental run is because of inline changes. During an incremental run, new emails are often added to a pre-existing thread and this can create longer branches, additional branches in that thread or both. This creates new last-in-time emails and the previous last-in-time emails become non-inclusive/non-unique. In some situations, the new last-in-time emails no longer contain 100% of the text included in all the subsumed emails down the branch; therefore, one or more emails down the branch can become inclusive for “inline changes.” These inline changes are often triggered by email formatting or footer differences that are not present in the newly added last-in-time email.
  - A less common reason a document can become unique after an incremental run is that there was a broken, that is, an incomplete family relationship that was updated by an incremental run. For example, the saved search used for a full, initial run contains a “broken family” where an email is present but its attachment is not part of the saved search. In the full/initial run, this email with a missing attachment is designated non-inclusive. Later on, in an incremental run, this missing attachment is included in the saved search and the email becomes inclusive/unique with the inclusive reason “has attachment.”
  - The last reason relates to upgrading LHA to a new version that may incorporate algorithm improvements as a part of the release. Documents can become unique after an incremental run when a recent LHA upgrade includes enhancements to the email threading algorithm. While quality is the primary focus of LHA's email threading offering, a new LHA release may include some minor enhancements to our email threading algorithm to address the recognition of foreign languages, unconventional or newly encountered email formats, improved handling of extracted text with poor formatting, etc. These enhancements can cause some documents to change from non-unique to unique during an incremental run.

General FAQs

General Information and Definitions:

LHA Analytic Sets and Outputs:

Email Threading Functionality & Settings:

Email Thread Viewer Display:

Near/Text Duplicate Document Identification:

Incremental Builds and Changes:

Related Articles