
Daniel Barak
April 25, 2023
With a background in computer science and almost 20 years in the eDiscovery world, I think a sober analysis of the impact of these advancements is much needed as a part of our continued conversations about a technology that some tech leaders have compared to the invention of the internet, or even electricity.
Here are some recurring questions that keep coming up regarding ChatGPT and eDiscovery: If ChatGPT can pass the bar, is it going to perform document review on its own or even replace attorneys? How long before we can just ask the AI to find all the responsive documents? I think there is an assumption among many that in a few months we will be able to just ingest all eDiscovery documents into a tool like ChatGPT and get all the answers. The reality, however, is more complicated..
Let’s start with some demystification. The following is a description of how a system like ChatGPT works: It tries to guess what word should come next in any given context. It sounds simple, but the more data it has been trained on, the better it becomes at this guessing game.
I will try to simplify some of the technical components and bring them down to earth or at least to give you a few AI terms that you can use to sound cool at social events.
Some key Terminology:
Generative Pre-trained Transformer (GPT): That’s why they call it ChatGPT…because you are chatting with a GPT. Ok, but what is a GPT? It’s a form of Large Language Model that has two levels of training. First there is unsupervised training (no humans involved) on massive amounts of data followed by a fine-tuning process wherein humans train a smaller subset by performing tasks such as providing the correct answers to questions. The goal is to leverage its extensive training data to accurately predict and generate (the G in GPT) the next word in a sequence, creating coherent text in response to user inputs, also known as prompts.
Tokens: Small units of text (words or pieces of a word) that the model works with. This term is important because Large Language Models typically have a maximum number of tokens that they can process and respond to. This limitation impacts a model’s ability to handle long text like full documents or articles.
Generative AI in eDiscovery: Strengths and Weaknesses

LLMs can be very effective at analyzing and summarizing content that is not part of their training model, in fact, there are many exciting and useful projects built around tools like ChatGPT that let you do things like chat with a PDF document and ask questions about its content. However, performing this process across a vast collection of documents, as is necessary in an eDiscovery case, introduces a set of distinct challenges.
There are essentially two ways that an LLM can provide insight into custom datasets. One way is to train the dataset into the model, essentially creating your own GPT that knows about the documents; the other is to use a generically trained model, such as GPT-4, and point it at a custom dataset stored in a separate database. Both approaches have advantages and disadvantages.
LLMs can be very effective at analyzing and summarizing content that is not part of their training model, in fact, there are many exciting and useful projects built around tools like ChatGPT that let you do things like chat with a PDF document and ask questions about its content. However, performing this process across a vast collection of documents, as is necessary in an eDiscovery case, introduces a set of distinct challenges.
There are essentially two ways that an LLM can provide insight into custom datasets. One way is to train the dataset into the model, essentially creating your own GPT that knows about the documents; the other is to use a generically trained model, such as GPT-4, and point it at a custom dataset stored in a separate database. Both approaches have advantages and disadvantages.

Training the dataset into the model

Here is an example of a ChatGPT Hallucination: I asked the question, “What is the first word of the text of the second chapter in Alice in Wonderland?”.
ChatGPT responded:
The first word of the text in the second chapter of “Alice’s Adventures in Wonderland” is “Alice.” The full sentence is: “Alice opened the door and found that it led into a small passage, not much larger than a rat-hole: she knelt down and looked along the passage into the loveliest garden you ever saw.”
This answer is incorrect. In fact, it gave a different answer each time I asked. The first sentence of chapter 2 actually begins as follows:
“Curiouser and curiouser! cried Alice”
Why do Hallucinations occur? Here’s an answer straight from ChatGPT: Hallucinations in language models like ChatGPT happen because they learn from imperfect data and have limitations in understanding context. Researchers are working to improve these models to make them more accurate and reliable.
I understand the irony of using a language model that can produce inaccurate text to explain why it can produce inaccurate text, but it’s not a bad explanation in this case.
If you were to build a LLM by training an entire dataset into the model, you could then ask questions about the dataset and the LLM, if trained properly, could provide what appear to be answers. Why do I say, “appear to be”? Because, as I mentioned earlier, Generative LLMs such as GPT are designed to predict the next word in a sequence based on patterns it has seen before. Sometimes it may generate text that sounds plausible but is not factually accurate. When a LLM generates inaccurate content, it is called a Hallucination.
Here is an example of a ChatGPT Hallucination: I asked the question, “What is the first word of the text of the second chapter in Alice in Wonderland?”.
ChatGPT responded:
The first word of the text in the second chapter of “Alice’s Adventures in Wonderland” is “Alice.” The full sentence is: “Alice opened the door and found that it led into a small passage, not much larger than a rat-hole: she knelt down and looked along the passage into the loveliest garden you ever saw.”
This answer is incorrect. In fact, it gave a different answer each time I asked. The first sentence of chapter 2 actually begins as follows:
“Curiouser and curiouser! cried Alice”
Why do Hallucinations occur? Here’s an answer straight from ChatGPT: Hallucinations in language models like ChatGPT happen because they learn from imperfect data and have limitations in understanding context. Researchers are working to improve these models to make them more accurate and reliable.
I understand the irony of using a language model that can produce inaccurate text to explain why it can produce inaccurate text, but it’s not a bad explanation in this case.

Point a LLM at external data

Despite these benefits, there are important caveats to consider. One limitation is that the answers provided by the language model may not be exhaustive or fully representative of the entire dataset, as they are based on the model’s search and analysis capabilities.
Another challenge arises from the maximum token limit imposed by large language models, which restricts the amount of text they can analyze and respond with at a given time. As a result, the AI might face difficulties in analyzing entire documents, particularly lengthy or complex ones, in a single pass.
These may sound like minor limitations, but what it means practically is that while you could ask the LLM a question about your data, it would not necessarily bring back an answer that encompasses the entire dataset. The answer would be limited to the documents returned by a search run by the LLM, potentially overlooking crucial documents that could have led to a more accurate response.
There are other practical uses of pointing a LLM at eDiscovery data that do not involve attempting to interact with it the way we interact with a service like ChatGPT. It could be used to summarize long documents or text message conversations. It could explain complex documents that would otherwise require input from an outside expert. Redaction tools can integrate LLMs to better identify and redact sensitive information and Personally Identifiable Information (PII). There are countless practical uses that do not involve asking questions of the dataset or attempting to have the LLM perform document review.
Despite these benefits, there are important caveats to consider. One limitation is that the answers provided by the language model may not be exhaustive or fully representative of the entire dataset, as they are based on the model’s search and analysis capabilities.
Another challenge arises from the maximum token limit imposed by large language models, which restricts the amount of text they can analyze and respond with at a given time. As a result, the AI might face difficulties in analyzing entire documents, particularly lengthy or complex ones, in a single pass.
These may sound like minor limitations, but what it means practically is that while you could ask the LLM a question about your data, it would not necessarily bring back an answer that encompasses the entire dataset. The answer would be limited to the documents returned by a search run by the LLM, potentially overlooking crucial documents that could have led to a more accurate response.
There are other practical uses of pointing a LLM at eDiscovery data that do not involve attempting to interact with it the way we interact with a service like ChatGPT. It could be used to summarize long documents or text message conversations. It could explain complex documents that would otherwise require input from an outside expert. Redaction tools can integrate LLMs to better identify and redact sensitive information and Personally Identifiable Information (PII). There are countless practical uses that do not involve asking questions of the dataset or attempting to have the LLM perform document review.

Hybrid Approach

Another space to watch would be companies such as Microsoft incorporating an implementation of GPT into Compliance solutions. This integration could enable organizations and compliance officers to proactively identify and restrict the storage and dissemination of documents and messages with potential legal risks, mitigating potential lawsuits long before they materialize.
Another space to watch would be companies such as Microsoft incorporating an implementation of GPT into Compliance solutions. This integration could enable organizations and compliance officers to proactively identify and restrict the storage and dissemination of documents and messages with potential legal risks, mitigating potential lawsuits long before they materialize.

Privacy Risks and Ethical Considerations in AI and eDiscovery


Key Takeaways
It is also essential to address the ethical considerations and privacy risks associated with using LLMs in legal contexts. The possibility of inaccuracies in the generated content highlights the importance of maintaining human oversight in the review process. As LLM technology advances and becomes more integrated into legal and compliance solutions, it is crucial to develop a deeper understanding of the technical limitations to ensure accuracy and reliability when utilizing them in the context eDiscovery.