BlogAnnounced at MongoDB.local NYC 2024: A recap of all announcements and updatesLearn more >>
MongoDB Developer
Atlas
plus
Sign in to follow topics
MongoDB Developer Centerchevron-right
Developer Topicschevron-right
Productschevron-right
Atlaschevron-right

Enhancing LLM Accuracy Using MongoDB Vector Search and Unstructured.io Metadata

Ronny Hoesada12 min read • Published Dec 04, 2023 • Updated Dec 04, 2023
AIPythonAtlas
Facebook Icontwitter iconlinkedin icon
Rate this tutorial
star-empty
star-empty
star-empty
star-empty
star-empty
Despite the remarkable strides in artificial intelligence, particularly in generative AI (GenAI), precision remains an elusive goal for large language model (LLM) outputs. According to the latest annual McKinsey Global Survey, “The state of AI in 2023,” GenAI has had a breakout year. Nearly one-quarter of C-suite executives personally use general AI tools for work, and over 25% of companies with AI implementations have general AI on their boards' agendas. Additionally, 40% of respondents plan to increase their organization's investment in AI due to advances in general AI. The survey reflects the immense potential and rapid adoption of AI technologies. However, the survey also points to a significant concern: inaccuracy.
Inaccuracy in LLMs often results in "hallucinations" or incorrect information due to limitations like shallow semantic understanding and varying data quality. Incorporating semantic vector search using MongoDB can help by enabling real-time querying of training data, ensuring that generated responses align closely with what the model has learned. Furthermore, adding metadata filtering extracted by Unstructured tools can refine accuracy by allowing the model to weigh the reliability of its data sources. Together, these methods can significantly minimize the risk of hallucinations and make LLMs more reliable.
This article addresses this challenge by providing a comprehensive guide on enhancing the precision of your LLM outputs using MongoDB's Vector Search and Unstructured Metadata extraction techniques. The main purpose of this tutorial is to equip you with the knowledge and tools needed to incorporate external source documents in your LLM, thereby enriching the model's responses with well-sourced and contextually accurate information. At the end of this tutorial, you can generate precise output from the OpenAI GPT-4 model to cite the source document, including the filename and page number. The entire notebook for this tutorial is available on Google Colab, but we will be going over sections of the tutorial together.

Why use MongoDB Vector Search?

MongoDB is a NoSQL database, which stands for "Not Only SQL," highlighting its flexibility in handling data that doesn't fit well in tabular structures like those in SQL databases. NoSQL databases are particularly well-suited for storing unstructured and semi-structured data, offering a more flexible schema, easier horizontal scaling, and the ability to handle large volumes of data. This makes them ideal for applications requiring quick development and the capacity to manage vast metadata arrays.
MongoDB's robust vector search capabilities and ability to seamlessly handle vector data and metadata make it an ideal platform for improving the precision of LLM outputs. It allows for multifaceted searches based on semantic similarity and various metadata attributes. This unique feature set distinguishes MongoDB from traditional developer data platforms and significantly enhances the accuracy and reliability of the results in language modeling tasks.

Why use Unstructured metadata?

The Unstructured open-source library provides components for ingesting and preprocessing images and text documents, such as PDFs, HTML, Word docs, and many more. The use cases of unstructured revolve around streamlining and optimizing the data processing workflow for LLMs. The Unstructured modular bricks and connectors form a cohesive system that simplifies data ingestion and pre-processing, making it adaptable to different platforms and efficiently transforming unstructured data into structured outputs.
Metadata is often referred to as "data about data." It provides contextual or descriptive information about the primary data, such as its source, format, and relevant characteristics. The metadata from the Unstructured tools tracks various details about elements extracted from documents, enabling users to filter and analyze these elements based on particular metadata of interest. The metadata fields include information about the source document and data connectors.
The concept of metadata is familiar, but its application in the context of unstructured data brings many opportunities. The Unstructured package tracks a variety of metadata at the element level. This metadata can be accessed with element.metadata and converted to a Python dictionary representation using element.metadata.to_dict().
In this article, we particularly focus on filename and page_number metadata to enhance the traceability and reliability of the LLM outputs. By doing so, we can cite the exact location of the PDF file that provides the answer to a user query. This becomes especially crucial when the LLM answers queries related to sensitive topics such as financial, legal, or medical questions.

Code walkthrough

Requirements

  1. Sign up for a MongoDB Atlas account and install the PyMongo library in the IDE of your choice or Colab.
  2. Install the Unstructured library in the IDE of your choice or Colab.
  3. Install the Sentence Transformer library for embedding in the IDE of your choice or Colab.
  4. Get the OpenAI API key. To do this, please ensure you have an OpenAI account.

Step-by-step process

  1. Extract the texts and metadata from source documents using Unstructured's partition_pdf.
  2. Prepare the data for storage and retrieval in MongoDB.
    • Vectorize the texts using the SentenceTransformer library.
    • Connect and upload records into MongoDB Atlas.
    • Query the index based on embedding similarity.
  3. Generate the LLM output using the OpenAI Model.
Step 1: Text and metadata extraction
Please make sure you have installed the required libraries to run the necessary code.
We'll delve into extracting data from a PDF document, specifically the seminal "Attention is All You Need" paper, using the partition_pdf function from the Unstructured library in Python. First, you'll need to import the function with from unstructured.partition.pdf import partition_pdf. Then, you can call partition_pdf and pass in the necessary parameters:
  • filename specifies the PDF file to process, which is "example-docs/Attention is All You Need.pdf."
  • strategy sets the extraction type, and for a more comprehensive scan, we use "hi_res."
  • Finally, infer_table_structured=True tells the function to also extract table metadata.
Properly set up, as you can see in our Colab file, the code looks like this:
By running this code, you'll populate the elements variable with all the extracted information from the PDF, ready for further analysis or manipulation. In the Colab’s code snippets, you can inspect the extracted texts and element metadata. To observe the sample outputs — i.e., the element type and text — please run the line below. Use a print statement, and please make sure the output you receive matches the one below.
Output:
You can also use Counter from Python Collection to count the number of element types identified in the document.
Finally, you can convert the element objects into Python dictionaries using convert_to_dict built-in function to selectively extract and modify the element metadata.
Step 2: Data preparation, storage, and retrieval
Step 2a: Vectorize the texts using the SentenceTransformer library.
We must include the extracted element metadata when storing and retrieving the texts from MongoDB Atlas to enable data retrieval with metadata and vector search.
First, we vectorize the texts to perform a similarity-based vector search. In this example, we use microsoft/mpnet-base from the Sentence Transformer library. This model has a 768 embedding size.
It is important to use a model with the same embedding size defined in MongoDB Atlas Index. Be sure to use the embedding size compatible with MongoDB Atlas indexes. You can define the index using the JSON syntax below:
Copy and paste the JSON index into your MongoDB collection so it can index the embedding field in the records. Please view this documentation on how to index vector embeddings for Vector Search.
Fig 1. build an index using knnVector type for the embedding field
Next, create the text embedding for each record before uploading them to MongoDB Atlas:
Step 2b: Connect and upload records into MongoDB Atlas
Before we can store our records on MongoDB, we will use the PyMongo library to establish a connection to the target MongoDB database and collection. Use this code snippet to connect and test the connection (see the MongoDB documentation on connecting to your cluster).
Once run, the output: “Pinged your deployment. You successfully connected to MongoDB!” will appear.
Next, we can upload the records using PyMongo's insert_many function.
To do this, we must first grab our MongoDB database connection string. Please make sure the database and collection names match with the ones in MongoDB Atlas.
Let’s preview the records in MongoDB Atlas:
Fig 2. preview the records in the MongoDB Atlas collection
Step 2c: Query the index based on embedding similarity
Now, we can retrieve the relevant records by computing the similarity score defined in the index vector search. When a user sends a query, we need to vectorize it using the same embedding model we used to store the data. Using the aggregate function, we can pass a pipeline that contains the information to perform a vector search.
Now that we have the records stored in MongoDB Atlas, we can search the relevant texts using the vector search. To do so, we need to vectorize the query using the same embedding model and use the aggregate function to retrieve the records from the index.
In the pipeline, we will specify the following:
  • index: The name of the vector search index in the collection
  • vector: The vectorized query from the user
  • k: Number of the most similar records we want to extract from the collection
  • score: The similarity score generated by MongoDB Atlas
The above pipeline will return the top five records closest to the user’s query embedding. We can define k to retrieve the top-k records in MongoDB Atlas. Please note that the results contain the metadata, text, and score. We can use this information to generate the LLM output in the following step.
Here’s one example of the top five nearest neighbors from the query above:
Step 3: Generate the LLM output with source document citation
We can generate the output using the OpenAI GPT-4 model. We will use the ChatCompletion function from OpenAI API for this final step. ChatCompletion API processes a list of messages to generate a model-driven response. Designed for multi-turn conversations, they're equally adept at single-turn tasks. The primary input is the 'messages' parameter, comprising an array of message objects with designated roles ("system", "user", or "assistant") and content. Usually initiated with a system message to guide the assistant's behavior, conversations can vary in length with alternating user and assistant messages. While the system message is optional, its absence may default the model to a generic helpful assistant behavior.
You’ll need an OpenAI API key to run the inferences. Before attempting this step, please ensure you have an OpenAI account. Assuming you store your OpenAI API key in your environment variable, you can import it using the os.getenv function:
Next, having a compelling prompt is crucial for generating a satisfactory result. Here’s the prompt to generate the output with specific reference where the information comes from — i.e., filename and page number.
In this Python script, a request is made to the OpenAI GPT-4 model through the ChatCompletion.create method to process a conversation. The conversation is structured with predefined roles and messages. It is instructed to generate a response based on the provided context and user query, summarizing the answer while citing the page number and file name. The temperature parameter set to 0.2 influences the randomness of the output, favoring more deterministic responses.

Evaluating the LLM output quality with source document

One of the key features of leveraging unstructured metadata in conjunction with MongoDB's Vector Search is the ability to provide highly accurate and traceable outputs.
You can insert this query into the ChatCompletion API as the “user” role and the context from MongoDB retrieval results as the “assistant” role. To enforce the model responds with the filename and page number, you can provide the instruction in the “system” role.
Source document: Fig 3. The relevant texts in the source document to answer user query
LLM Output: Fig 4. The answer from GPT-4 model refers to the filename and page number
The highly specific output cites information from the source document, "Attention is All You Need.pdf," stored in the 'example-docs' directory. The answers are referenced with exact page numbers, making it easy for anyone to verify the information. This level of detail is crucial when answering queries related to research, legal, or medical questions, and it significantly enhances the trustworthiness and reliability of the LLM outputs.

Conclusion

This article presents a method to enhance LLM precision using MongoDB's Vector Search and Unstructured Metadata extraction techniques. These approaches, facilitating real-time querying and metadata filtering, substantially mitigate the risk of incorrect information generation. MongoDB's capabilities, especially in handling vector data and facilitating multifaceted searches, alongside the Unstructured library's data processing efficiency, emerge as robust solutions. These techniques not only improve accuracy but also enhance the traceability and reliability of LLM outputs, especially when dealing with sensitive topics, equipping users with the necessary tools to generate more precise and contextually accurate outputs from LLMs.
Ready to get started? Request your Unstructured API key today and unlock the power of Unstructured API and Connectors. Join the Unstructured community group to connect with other users, ask questions, share your experiences, and get the latest updates. We can’t wait to see what you’ll build.

Facebook Icontwitter iconlinkedin icon
Rate this tutorial
star-empty
star-empty
star-empty
star-empty
star-empty
Related
Tutorial

Create a Multi-Cloud Cluster with MongoDB Atlas


Sep 23, 2022 | 11 min read
Article

Query Analytics Part 1: Know Your Queries


Jan 05, 2024 | 6 min read
Article

Auto Pausing Inactive Clusters


Nov 03, 2022 | 10 min read
Article

A Free REST API for Johns Hopkins University COVID-19 dataset


Nov 16, 2023 | 5 min read
Table of Contents