Enhancing LLM Accuracy Using MongoDB Vector Search and Unstructured.io Metadata

Despite the remarkable strides in artificial intelligence, particularly in generative AI (GenAI), precision remains an elusive goal for large language model (LLM) outputs. According to the latest annual McKinsey Global Survey, “The state of AI in 2023,” GenAI has had a breakout year. Nearly one-quarter of C-suite executives personally use general AI tools for work, and over 25% of companies with AI implementations have general AI on their boards' agendas. Additionally, 40% of respondents plan to increase their organization's investment in AI due to advances in general AI. The survey reflects the immense potential and rapid adoption of AI technologies. However, the survey also points to a significant concern: inaccuracy.

Inaccuracy in LLMs often results in "hallucinations" or incorrect information due to limitations like shallow semantic understanding and varying data quality. Incorporating semantic vector search using MongoDB can help by enabling real-time querying of training data, ensuring that generated responses align closely with what the model has learned. Furthermore, adding metadata filtering extracted by Unstructured tools can refine accuracy by allowing the model to weigh the reliability of its data sources. Together, these methods can significantly minimize the risk of hallucinations and make LLMs more reliable.

This article addresses this challenge by providing a comprehensive guide on enhancing the precision of your LLM outputs using MongoDB's Vector Search and Unstructured Metadata extraction techniques. The main purpose of this tutorial is to equip you with the knowledge and tools needed to incorporate external source documents in your LLM, thereby enriching the model's responses with well-sourced and contextually accurate information. At the end of this tutorial, you can generate precise output from the OpenAI GPT-4 model to cite the source document, including the filename and page number. The entire notebook for this tutorial is available on Google Colab, but we will be going over sections of the tutorial together.

Why use MongoDB Vector Search?

MongoDB is a NoSQL database, which stands for "Not Only SQL," highlighting its flexibility in handling data that doesn't fit well in tabular structures like those in SQL databases. NoSQL databases are particularly well-suited for storing unstructured and semi-structured data, offering a more flexible schema, easier horizontal scaling, and the ability to handle large volumes of data. This makes them ideal for applications requiring quick development and the capacity to manage vast metadata arrays.

MongoDB's robust vector search capabilities and ability to seamlessly handle vector data and metadata make it an ideal platform for improving the precision of LLM outputs. It allows for multifaceted searches based on semantic similarity and various metadata attributes. This unique feature set distinguishes MongoDB from traditional developer data platforms and significantly enhances the accuracy and reliability of the results in language modeling tasks.

Why use Unstructured metadata?

The Unstructured open-source library provides components for ingesting and preprocessing images and text documents, such as PDFs, HTML, Word docs, and many more. The use cases of unstructured revolve around streamlining and optimizing the data processing workflow for LLMs. The Unstructured modular bricks and connectors form a cohesive system that simplifies data ingestion and pre-processing, making it adaptable to different platforms and efficiently transforming unstructured data into structured outputs.

Metadata is often referred to as "data about data." It provides contextual or descriptive information about the primary data, such as its source, format, and relevant characteristics. The metadata from the Unstructured tools tracks various details about elements extracted from documents, enabling users to filter and analyze these elements based on particular metadata of interest. The metadata fields include information about the source document and data connectors.

The concept of metadata is familiar, but its application in the context of unstructured data brings many opportunities. The Unstructured package tracks a variety of metadata at the element level. This metadata can be accessed with element.metadata and converted to a Python dictionary representation using element.metadata.to_dict().

In this article, we particularly focus on filename and page_number metadata to enhance the traceability and reliability of the LLM outputs. By doing so, we can cite the exact location of the PDF file that provides the answer to a user query. This becomes especially crucial when the LLM answers queries related to sensitive topics such as financial, legal, or medical questions.

Code walkthrough

Requirements

Sign up for a MongoDB Atlas account and install the PyMongo library in the IDE of your choice or Colab.
Install the Unstructured library in the IDE of your choice or Colab.
Install the Sentence Transformer library for embedding in the IDE of your choice or Colab.
Get the OpenAI API key. To do this, please ensure you have an OpenAI account.

Step-by-step process

Extract the texts and metadata from source documents using Unstructured's partition_pdf.
Prepare the data for storage and retrieval in MongoDB.
- Vectorize the texts using the SentenceTransformer library.
- Connect and upload records into MongoDB Atlas.
- Query the index based on embedding similarity.
Generate the LLM output using the OpenAI Model.

Step 1: Text and metadata extraction

Please make sure you have installed the required libraries to run the necessary code.

Code Snippet

We'll delve into extracting data from a PDF document, specifically the seminal "Attention is All You Need" paper, using the partition_pdf function from the Unstructured library in Python. First, you'll need to import the function with from unstructured.partition.pdf import partition_pdf. Then, you can call partition_pdf and pass in the necessary parameters:

filename specifies the PDF file to process, which is "example-docs/Attention is All You Need.pdf."
strategy sets the extraction type, and for a more comprehensive scan, we use "hi_res."
Finally, infer_table_structured=True tells the function to also extract table metadata.

Properly set up, as you can see in our Colab file, the code looks like this:

Code Snippet

By running this code, you'll populate the elements variable with all the extracted information from the PDF, ready for further analysis or manipulation. In the Colab’s code snippets, you can inspect the extracted texts and element metadata. To observe the sample outputs — i.e., the element type and text — please run the line below. Use a print statement, and please make sure the output you receive matches the one below.

Code Snippet

Output:

Code Snippet

(unstructured.documents.elements.NarrativeText,
 'The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English- to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.')
(unstructured.documents.elements.NarrativeText,
 '∗Equal contribution. Listing order is random....

You can also use Counter from Python Collection to count the number of element types identified in the document.

Code Snippet

Finally, you can convert the element objects into Python dictionaries using convert_to_dict built-in function to selectively extract and modify the element metadata.

Code Snippet

Step 2: Data preparation, storage, and retrieval

Step 2a: Vectorize the texts using the SentenceTransformer library.

We must include the extracted element metadata when storing and retrieving the texts from MongoDB Atlas to enable data retrieval with metadata and vector search.

First, we vectorize the texts to perform a similarity-based vector search. In this example, we use microsoft/mpnet-base from the Sentence Transformer library. This model has a 768 embedding size.

Code Snippet

It is important to use a model with the same embedding size defined in MongoDB Atlas Index. Be sure to use the embedding size compatible with MongoDB Atlas indexes. You can define the index using the JSON syntax below:

Code Snippet

Copy and paste the JSON index into your MongoDB collection so it can index the embedding field in the records. Please view this documentation on how to index vector embeddings for Vector Search.

Next, create the text embedding for each record before uploading them to MongoDB Atlas:

Code Snippet

for record in records:
    txt = record['text']
    
    # use the embedding model to vectorize the text into the record
    record['embedding'] = model.encode(txt).tolist()

# print the first record with embedding
records[0]

# output
{'type': 'NarrativeText',
 'element_id': '6b82d499d67190c0ceffe3a99958e296',
 'metadata': {'coordinates': {'points': ((327.6542053222656,
     199.8135528564453),
    (327.6542053222656, 315.7165832519531),
    (1376.0062255859375, 315.7165832519531),
    (1376.0062255859375, 199.8135528564453)),
   'system': 'PixelSpace',
   'layout_width': 1700,
   'layout_height': 2200},
  'filename': 'Attention is All You Need.pdf',
  'last_modified': '2023-10-09T20:15:36',
  'filetype': 'application/pdf',
  'page_number': 1,
  'detection_class_prob': 0.5751863718032837},
 'text': 'Provided proper attribution is provided, Google hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works.',
 'embedding': [-0.018366225063800812,
  -0.10861606895923615,
  0.00344603369012475,
  0.04939081519842148,
  -0.012352174147963524,
  -0.04383034259080887,...],
'_id': ObjectId('6524626a6d1d8783bb807943')}
}

Step 2b: Connect and upload records into MongoDB Atlas

Before we can store our records on MongoDB, we will use the PyMongo library to establish a connection to the target MongoDB database and collection. Use this code snippet to connect and test the connection (see the MongoDB documentation on connecting to your cluster).

Code Snippet

Once run, the output: “Pinged your deployment. You successfully connected to MongoDB!” will appear.

Next, we can upload the records using PyMongo's insert_many function.

To do this, we must first grab our MongoDB database connection string. Please make sure the database and collection names match with the ones in MongoDB Atlas.

Code Snippet

Let’s preview the records in MongoDB Atlas:

Step 2c: Query the index based on embedding similarity

Now, we can retrieve the relevant records by computing the similarity score defined in the index vector search. When a user sends a query, we need to vectorize it using the same embedding model we used to store the data. Using the aggregate function, we can pass a pipeline that contains the information to perform a vector search.

Now that we have the records stored in MongoDB Atlas, we can search the relevant texts using the vector search. To do so, we need to vectorize the query using the same embedding model and use the aggregate function to retrieve the records from the index.

In the pipeline, we will specify the following:

index: The name of the vector search index in the collection
vector: The vectorized query from the user
k: Number of the most similar records we want to extract from the collection
score: The similarity score generated by MongoDB Atlas

Code Snippet

The above pipeline will return the top five records closest to the user’s query embedding. We can define k to retrieve the top-k records in MongoDB Atlas. Please note that the results contain the metadata, text, and score. We can use this information to generate the LLM output in the following step.

Here’s one example of the top five nearest neighbors from the query above:

Code Snippet

{'element_id': '7128012294b85295c89efee3bc5e72d2',
  'metadata': {'coordinates': {'layout_height': 2200,
                               'layout_width': 1700,
                               'points': [[290.50477600097656,
                                           1642.1170677777777],
                                          [290.50477600097656,
                                           1854.9523748867755],
                                          [1403.820083618164,
                                           1854.9523748867755],
                                          [1403.820083618164,
                                           1642.1170677777777]],
                               'system': 'PixelSpace'},
               'detection_class_prob': 0.9979791045188904,
               'file_directory': 'example-docs',
               'filename': 'Attention is All You Need.pdf',
               'filetype': 'application/pdf',
               'last_modified': '2023-09-20T17:08:35',
               'page_number': 3,
               'parent_id': 'd1375b5e585821dff2d1907168985bfe'},
  'score': 0.2526094913482666,
  'text': 'Decoder: The decoder is also composed of a stack of N = 6 identical '
          'layers. In addition to the two sub-layers in each encoder layer, '
          'the decoder inserts a third sub-layer, which performs multi-head '
          'attention over the output of the encoder stack. Similar to the '
          'encoder, we employ residual connections around each of the '
          'sub-layers, followed by layer normalization. We also modify the '
          'self-attention sub-layer in the decoder stack to prevent positions '
          'from attending to subsequent positions. This masking, combined with '
          'fact that the output embeddings are offset by one position, ensures '
          'that the predictions for position i can depend only on the known '
          'outputs at positions less than i.',
  'type': 'NarrativeText'}

Step 3: Generate the LLM output with source document citation

We can generate the output using the OpenAI GPT-4 model. We will use the ChatCompletion function from OpenAI API for this final step. ChatCompletion API processes a list of messages to generate a model-driven response. Designed for multi-turn conversations, they're equally adept at single-turn tasks. The primary input is the 'messages' parameter, comprising an array of message objects with designated roles ("system", "user", or "assistant") and content. Usually initiated with a system message to guide the assistant's behavior, conversations can vary in length with alternating user and assistant messages. While the system message is optional, its absence may default the model to a generic helpful assistant behavior.

You’ll need an OpenAI API key to run the inferences. Before attempting this step, please ensure you have an OpenAI account. Assuming you store your OpenAI API key in your environment variable, you can import it using the os.getenv function:

Code Snippet

Next, having a compelling prompt is crucial for generating a satisfactory result. Here’s the prompt to generate the output with specific reference where the information comes from — i.e., filename and page number.

Code Snippet

In this Python script, a request is made to the OpenAI GPT-4 model through the ChatCompletion.create method to process a conversation. The conversation is structured with predefined roles and messages. It is instructed to generate a response based on the provided context and user query, summarizing the answer while citing the page number and file name. The temperature parameter set to 0.2 influences the randomness of the output, favoring more deterministic responses.

Evaluating the LLM output quality with source document

One of the key features of leveraging unstructured metadata in conjunction with MongoDB's Vector Search is the ability to provide highly accurate and traceable outputs.

Code Snippet

You can insert this query into the ChatCompletion API as the “user” role and the context from MongoDB retrieval results as the “assistant” role. To enforce the model responds with the filename and page number, you can provide the instruction in the “system” role.

Code Snippet

Source document:

LLM Output:

The highly specific output cites information from the source document, "Attention is All You Need.pdf," stored in the 'example-docs' directory. The answers are referenced with exact page numbers, making it easy for anyone to verify the information. This level of detail is crucial when answering queries related to research, legal, or medical questions, and it significantly enhances the trustworthiness and reliability of the LLM outputs.

Conclusion

This article presents a method to enhance LLM precision using MongoDB's Vector Search and Unstructured Metadata extraction techniques. These approaches, facilitating real-time querying and metadata filtering, substantially mitigate the risk of incorrect information generation. MongoDB's capabilities, especially in handling vector data and facilitating multifaceted searches, alongside the Unstructured library's data processing efficiency, emerge as robust solutions. These techniques not only improve accuracy but also enhance the traceability and reliability of LLM outputs, especially when dealing with sensitive topics, equipping users with the necessary tools to generate more precise and contextually accurate outputs from LLMs.

Ready to get started? Request your Unstructured API key today and unlock the power of Unstructured API and Connectors. Join the Unstructured community group to connect with other users, ask questions, share your experiences, and get the latest updates. We can’t wait to see what you’ll build.

Atlas

Enhancing LLM Accuracy Using MongoDB Vector Search and Unstructured.io Metadata

Why use MongoDB Vector Search?

Why use Unstructured metadata?

Code walkthrough

Requirements

Step-by-step process

Evaluating the LLM output quality with source document

Conclusion

Related

Create a Multi-Cloud Cluster with MongoDB Atlas

Query Analytics Part 1: Know Your Queries

Auto Pausing Inactive Clusters

A Free REST API for Johns Hopkins University COVID-19 dataset

Table of Contents