BlogAnnounced at MongoDB.local NYC 2024: A recap of all announcements and updatesLearn more >>
MongoDB Developer
Atlas
plus
Sign in to follow topics
MongoDB Developer Centerchevron-right
Developer Topicschevron-right
Productschevron-right
Atlaschevron-right

Build an Image Search Engine With Python & MongoDB

Mark Smith8 min read • Published Jan 31, 2024 • Updated Jan 31, 2024
JupyterAtlasSearchPython
Facebook Icontwitter iconlinkedin icon
Rate this tutorial
star-empty
star-empty
star-empty
star-empty
star-empty

Building an Image Search Engine With Python & MongoDB

I can still remember when search started to work in Google Photos — the platform where I store all of the photos I take on my cellphone. It seemed magical to me that some kind of machine learning technique could allow me to describe an image in my vast collection of photos and have the platform return that image to me, along with any similar images.
One of the techniques used for this is image classification, where a neural network is used to identify objects and even people in a scene, and the image is tagged with this data. Another technique — which is, if anything, more powerful — is the ability to generate a vector embedding for the image using an embedding model that works with both text and images.
Using a multi-modal embedding model like this allows you to generate a vector that can be stored and efficiently indexed in MongoDB Atlas, and then when you wish to retrieve an image, the same embedding model can be used to generate a vector that is then used to search for images that are similar to the description. It's almost like magic.

Multi-modal embedding models

A multi-modal embedding model is a machine learning model that encodes information from various data types, like text and images, into a common vector space. It helps link different types of data for tasks such as text-to-image matching or translating between modalities.
The benefit of this is that text and images can be indexed in the same way, allowing images to be searched for by providing either text or another image. You could even search for an item of text with an image, but I can't think of a reason you'd want to do that. The downside of multi-modal models is that they are very complex to produce and thus aren't quite as "clever" as some of the single-mode models that are currently being produced.
In this tutorial, I'll show you how to use the clip-ViT-L-14 model, which encodes both text and images into the same vector space. Because we're using Python, I'll install the model directly into my Python environment to run locally. In production, you probably wouldn't want to have your embedding model running directly inside your web application because it too tightly couples your model, which requires a powerful GPU, to the rest of your application, which will usually be mostly IO-bound. In that case, you can host an appropriate model on Hugging Face or a similar platform.

Describing the search engine

This example search engine is going to be very much a proof of concept. All the code is available in a Jupyter Notebook, and I'm going to store all my images locally on disk. In production, you'd want to use an object storage service like Amazon's S3.
In the same way, in production, you'd either want to host the model using a specialized service or some dedicated setup on the appropriate hardware, whereas I'm going to download and run the model locally.
If you've got an older machine, it may take a while to generate the vectors, but I found on a four-year-old Intel MacBook Pro I could generate about 1,000 embeddings in 30 minutes, or my MacBook Air M2 can do the same in about five minutes! Either way, maybe go away and make yourself a cup of coffee when the notebook gets to that step.
The search engine will use the same vector model to encode queries (which are text) into the same vector space that was used to encode image data, which means that a phrase describing an image should appear in a similar location to the image’s location in the vector space. This is the magic of multi-modal vector models!

Getting ready to run the notebook

All of the code described in this tutorial is hosted on GitHub.
The first thing you'll want to do is create a virtual environment using your favorite technique. I tend to use venv, which comes with Python.
Once you've done that, install dependencies with:
Next, you'll need to set an environment variable, MONGODB_URI, containing the connection string for your MongoDB cluster.
One more thing you'll need is an "images" directory, containing some images to index! I downloaded  Kaggle's ImageNet 1000 (mini) dataset, which contains lots of images at around 4GB, but you can use a different dataset if you prefer. The notebook searches the "images" directory recursively, so you don't need to have everything at the top level.
Then, you can fire up the notebook with:

Understanding the code

If you've set up the notebook as described above, you should be able to execute it and follow the explanations in the notebook. In this tutorial, I'm going to highlight the most important code, but I'm not going to reproduce it all here, as I worked hard to make the notebook understandable on its own.

Setting up the collection

First, let's configure a collection with an appropriate vector search index. In Atlas, if you connect to a cluster, you can configure vector search indexes in the Atlas Search tab, but I prefer to configure indexes in my code to keep everything self-contained.
The following code can be run many times but will only create the collection and associated search index on the first run. This is helpful if you want to run the notebook several times!
The most important part of the code above is the configuration being passed to create_search_index:
This specifies that the index will index all fields in the document (because "dynamic" is set to "true") and that the "embedding" field should be indexed as a vector embedding, using cosine similarity. Currently, "knnVector" is the only kind supported by Atlas. The dimension of the vector is set to 768 because that is the number of vector dimensions used by the CLIP model.

Loading the CLIP model

The following line of code may not look like much, but the first time you execute it, it will download the clip-ViT-L-14 model, which is around 2GB:

Generating and storing a vector embedding

Given a path to an image file, an embedding for that image can be generated with the following code:
In this line of code, model is the SentenceTransformer I created above, and Image comes from the Pillow library and is used to load the image data.
With the embedding vector, a new document can be created with the code below:
I'm only storing the path to the image (as a unique identifier) and the embedding vector. In a real-world application, I'd store any image metadata my application required and probably a URL to an S3 object containing the image data itself.
Note: Remember that vector queries can be combined with any other query technique you'd normally use in MongoDB! That's the huge advantage you get using Atlas Vector Search — it's part of MongoDB Atlas, so you can query and transform your data any way you want and even combine it with the power of Atlas Search for free text queries.
The Jupyter Notebook loads images in a loop — by default, it loads 10 images — but that's not nearly enough to see the benefits of an image search engine, so you'll probably want to change NUMBER_OF_IMAGES_TO_LOAD to 1000 and run the image load code block again.

Searching for images

Once you've indexed a good number of images, it's time to test how well it works. I've defined two functions that can be used for this. The first function, display_images, takes a list of documents and displays the associated images in a grid. I'm not including the code here because it's a utility function.
The second function, image_search, takes a text phrase, encodes it as a vector embedding, and then uses MongoDB's $vectorSearch aggregation stage to look up images that are closest to that vector location, limiting the result to the nine closest documents:
The $project stage adds a "score" field that shows how similar each document was to the original query vector. 1.0 means "exactly the same," whereas 0.0 would mean that the returned image was totally dissimilar.
With the display_images function and the image_search function, I can search for images of "sharks in the water":
On my laptop, I get the following grid of nine images, which is pretty good!
A screenshot, showing a grid containing 9 photos of sharks
When I first tried the above search out, I didn't have enough images loaded, so the query above included a photo of a corgi standing on gray tiles. That wasn't a particularly close match! After I loaded some more images to fix the results of the shark query, I could still find the corgi image by searching for "corgi on snow" — it's the second image below. Notice that none of the images exactly match the query, but a couple are definitely corgis, and several are standing in the snow.
A grid of photos. Most photos contain either a dog or snow, or both. One of the dogs is definitely a corgi.
One of the things I really love about vector search is that it's "semantic" so I can search by something quite nebulous, like "childhood."
A grid of photographs of children or toys or things like colorful erasers.
My favorite result was when I searched for "ennui" (a feeling of listlessness and dissatisfaction arising from a lack of occupation or excitement) which returned photos of bored animals (and a teenager)!
Photographs of animals looking bored and slightly sad, except for one photo which contains a young man looking bored and slightly sad.

Next steps

I hope you found this tutorial as fun to read as I did to write!
If you wanted to run this model in production, you would probably want to use a hosting service like Hugging Face, but I really like the ability to install and try out a model on my laptop with a single line of code. Once the embedding generation, which is processor-intensive and thus a blocking task, is delegated to an API call, it would be easier to build a FastAPI wrapper around the functionality in this code. Then, you could build a powerful web interface around it and deploy your own customized image search engine.
This example also doesn't demonstrate much of MongoDB's query capabilities. The power of vector search with MongoDB Atlas is the ability to combine it with all the power of MongoDB's aggregation framework to query and aggregate your data. If I have some time, I may extend this example to filter by criteria like the date of each photo and maybe allow photos to be tagged manually, or to be automatically grouped into albums.

Further reading


Facebook Icontwitter iconlinkedin icon
Rate this tutorial
star-empty
star-empty
star-empty
star-empty
star-empty
Related
Tutorial

Using OpenAI Latest Embeddings In A RAG System With MongoDB


Feb 01, 2024 | 15 min read
Tutorial

How to Send MongoDB Document Changes to a Slack Channel


Oct 26, 2023 | 6 min read
Article

Realm Triggers Treats and Tricks - Document-Based Trigger Scheduling


Sep 23, 2022 | 5 min read
Article

Christmas Lights and Webcams with the MongoDB Data API


Aug 26, 2022 | 14 min read
Table of Contents
  • Building an Image Search Engine With Python & MongoDB