GenAI for Entity Extraction in Unstructured Natural Language Data

Santhosh Shanmugam
Make It New
Published in
5 min readJan 25, 2024

--

This blog post is meant for those who are familiar with GenAI technologies and the use case of entity extraction. There is a need for automated and effective solutions for extracting entities from large corpora of text. In my role as a consultant, I have frequently encountered businesses possessing substantial amounts of unstructured natural language data. The extraction of specific information from this data is also necessary for many of these businesses, and ordinarily, this is solved through manual labor. A task force is often created with the sole purpose of reading through the data and extracting what is relevant. This is, of course, both time-consuming and prone to human error. GenAI technologies, when used correctly, can automatically and effectively retrieve information from larger bodies of natural language. The following blog post will delve into my experience with utilizing GenAI models for the purpose of text extraction.

Large amounts of natural language data.

The first step of any machine learning project is, of course, data acquisition. In the specific case of GenAI for entity extraction, you will need an appropriate dataset of labeled data to later evaluate the performance of the GenAI solution. If you’re unable to verify the accuracy of the models’ outputs, it is advisable to refrain from employing GenAI solutions for inference.

… Labeling data is a must

The more you know about your data the better you will be able to tailor your prompts. Hence, data analysis can, and should, be coupled with labeling. The procedure chosen for labeling should naturally follow labeling best practices, e.g., ensuring a common understanding among the labelers of what is to be expected for the different labels through strict labeling rules. This can be achieved by having multiple labelers per label and a consensus on how to handle bad data. In doing this, you’ll gather information crucial for prompt generation. Providing context for each label and its intended usage will enhance the GenAI models’ performance. State-of-the-art models like GPT-4 and Gemini Pro are trained on extensive data and will hence be able to recognize most entities that need extraction. Yet there are instances where less-known entities need identification. Offering context on how these entities will be used and appear in the text proves helpful in such cases. You will also be able to identify exceptions, a crucial aspect when trying to ensure comprehensive coverage.

What are then the next steps after retrieving a labeled dataset?

One of the initial insights I acquired while delving into the actual testing of the GenAI APIs, was the difficulty in achieving consistent output. Machine learning models are inherently non-deterministic, and generative AI models introduce an additional layer of uncertainty to this challenge. To use such a model for entity extraction you will first have to adjust the temperature to 0. This is a hyperparameter that controls the randomness of the model. A temperature of 0 ensures that only the highest probability tokens are chosen. Even with this hyperparameter, the model may produce different outputs for the same input. Hence, in ensuring a consistent output, I discovered the value of prompt engineering.

  1. Standardize the output format

Defining the output format of GenAI models through prompt engineering can be achieved through tools like Langchain or by presenting specific examples. Langchain’s StructuredOutput parser allows you to specify a JSON format of the expected output. In the StructuredOutput parser, you also have the capability to specify a ResponseSchema, allowing you to define the format for each individual entity. Such a format specification provides a means to validate the model’s output against predefined criteria. This validation can then be used to further refine the prompt through a trial-and-error approach.

2. A consistent output necessitates iterative testing

By iteratively testing the GenAI APIs with different data points, it becomes possible to pinpoint instances where the model deviates from the desired format. By giving this contextual knowledge to the model through additional examples in the prompt, you can enhance its overall consistency. Incorporating these instances as examples in the prompt may not seem to be scalable at first glance. However, practical observations reveal that there is a consistency in the shortcomings of both GPT-4 and Gemini Pro, i.e., frequent instances of incorrect output formats for the same data points. Hence, in practice, only a few examples are needed before you can experience a significant improvement.

Rather than incorporating a chosen set of instances directly in your prompt, you could also do this dynamically through RAG (Retrieval Augmented Generation). This involves leveraging a vector database (e.g., chromadb) containing labeled examples and conducting a similarity search on incoming text corpora to identify relevant text that can be used as context in prompts. However, it’s important to note that implementing this approach introduces additional complexity and maintenance efforts, as the vector database needs to be kept up-to-date. Unfortunately, even after moving from zero-shot prompting to few-shot prompting the model may output an incorrectly formatted answer.

3. Expect to fail

Defensive UX is necessary when utilizing GenAI for deterministic tasks. With this perspective, errors can be anticipated and gracefully managed. In the case of entity extraction, the inconsistencies of the output format make the system fail even though the GenAI models correctly identify entities. Hence, not only should the previously mentioned trial-and-error approach be utilized to detect in-context examples, but it should also be used to identify how best to handle errors. A JSON repair library can be used in case the model fails to output the JSON correctly. In some cases, it might be beneficial to prompt the model to extract a more readily extractable entity which can then be utilized in a dictionary lookup to extract another entity. If there are no such hacks that can be done in order to fix the output and retrieve the correct entities, you will have to return a correctly formatted None value, so that the system does not fail when the GenAI API returns an incorrectly formatted value.

A flowchart depicting how you go from large amounts of natural language data to specific entities through the joint effort of both GenAI and manual labeling.
Flowchart illustrating the sequential process of utilizing GenAI for entity extraction.

Using these methods, you should be able to make use of GenAI APIs for the deterministic task of entity extraction. It’s worth noting that advanced GenAI models such as GPT-4 and Gemini Pro offer capabilities beyond mere entity extraction. In certain scenarios, employing such LLMs may be excessive or even unnecessary. Instead of using these APIs, an alternative involves manually annotating a substantial amount of data to train a customized language model designed specifically for entity extraction on your data. While this method is viable for datasets with consistent patterns, it may prove less efficient when dealing with inconsistent data. Furthermore, this alternative introduces challenges similar to those encountered when employing manual task forces. Therefore, I advocate employing a blend of smaller LLMs trained on more targeted extractions, along with a GenAI solution designed for handling the less predictable use cases as the optimal approach.

[1]: White, Jules et al. “ChatGPT Prompt Patterns for Improving Code Quality, Refactoring, Requirements Elicitation, and Software Design.”

[2]: Lewis, Patrick et al. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.”

--

--