BlogAnnounced at MongoDB.local NYC 2024: A recap of all announcements and updatesLearn more >>
MongoDB Developer
MongoDB
plus
Sign in to follow topics
MongoDB Developer Centerchevron-right
Developer Topicschevron-right
Productschevron-right
MongoDBchevron-right

Orchestrating MongoDB & BigQuery for ML Excellence with PyMongoArrow and BigQuery Pandas Libraries

Venkatesh Shanbhag, Maruti C4 min read • Published Feb 07, 2024 • Updated Feb 08, 2024
PandasGoogle CloudAIPythonMongoDB
Facebook Icontwitter iconlinkedin icon
Rate this tutorial
star-empty
star-empty
star-empty
star-empty
star-empty
In today's data-driven world, the ability to analyze and efficiently move data across different platforms is crucial. MongoDB Atlas and Google BigQuery are two powerful platforms frequently used for managing and analyzing data. While they excel in their respective domains, connecting and transferring data between them seamlessly can pose challenges. However, with the right tools and techniques, this integration becomes not only possible but also streamlined.
One effective way to establish a smooth pipeline between MongoDB Atlas and BigQuery is by leveraging PyMongoArrow and pandas-gbq, two powerful Python libraries that facilitate data transfer and manipulation. PyMongoArrow acts as a bridge between MongoDB and Arrow, a columnar in-memory analytics layer, enabling efficient data conversion. On the other hand, pandas-gbq is a Python client library for Google BigQuery, allowing easy interaction with BigQuery datasets.
Image 1: Architecture diagram for MongoDB interaction with BigQuery and VertexAI using python libraries

Advantages of a Python-based solution

  1. Easily implement movement of a wide range of datatypes between MongoDB and BigQuery.
  2. Easily join multiple data sources like cloud storage, Google databases, MongoDB Atlas etc. and transform the data using pandas dataframes.
  3. You can use your favorite notebook to build the solution, including the new preview notebook available in BigQuery Studio.
  4. Perform exploratory data analysis on data read from both Google BigQuery and MongoDB Atlas platforms without physically moving the data between these platforms. This will simplify the effort required by data engineers to move the data and offers a faster way for data scientists to build machine learning (ML) models.
Let's discuss each of the implementation advantages with examples.

ETL data from MongoDB to BigQuery

Let’s consider a sample shipwreck dataset available on MongoDB Atlas for this use case.
Use the commands below to install the required libraries on the notebook environment of your choice. For easy and scalable setup, use BigQuery Jupyter notebooks or managed VertexAI workbench notebooks.
Image 2: screenshot of jupyter notebook for BigQuery DataFrames implementation.
First, establish a connection to your MongoDB Atlas cluster using PyMongoArrow. This involves configuring authentication and selecting the database and collection from which you want to transfer data. Follow MongoDB Atlas documentation for setting up your cluster, network access, and authentication. Load a sample dataset to your Atlas cluster. Get the Atlas connection string and replace the URI string below with your connection string. The below script is also available in the GitHub repository with steps to set up.
Transform the data to the required format — e.g., transform and remove the unsupported data formats, like the MongoDB object ID, or convert the MongoDB object to JSON before writing it to BigQuery. Please refer to the documentation to learn more about data types supported by pandas-gbq and PyMongoArrow.
Once you have retrieved data from MongoDB Atlas and converted it into a suitable format using PyMongoArrow, you can proceed to transfer it to BigQuery using either the pandas-gbq or google-cloud-bigquery. In this article, we are using pandas-gbq. Refer to the documentation for more details on the differences between pandas-gbq and google-cloud-bigquery libraries. Ensure you have a dataset in BigQuery to which you want to load the MongoDB data. You can create a new dataset or use an existing one.
As you embark on building your pipeline, optimizing the data transfer process between MongoDB Atlas and BigQuery is essential for performance. A few points to consider:
  1. Batch Dataframes in to chunks, especially when dealing with large datasets, to prevent memory issues.
  2. Handle schema mapping and data type conversions properly to ensure compatibility between the source and destination databases.
  3. With the right tools like Google colab, VertexAI Workbench etc, this pipeline can become a cornerstone of your data ecosystem, facilitating smooth and reliable data movement between MongoDB Atlas and Google BigQuery.

Introduction to Google BigQuery DataFrames (bigframes)

Google bigframes is a Python API that provides a pandas-compatible DataFrame and machine learning capabilities powered by the BigQuery engine. It provides a familiar pandas interface for data manipulation and analysis. Once the data from MongoDB is written into BigQuery, the BigQuery DataFrames can unlock the user-friendly solution for analyzing petabytes of data with ease. The pandas DataFrame can be read directly into BigQuery DataFrames using the Python bigframes.pandas library. Install the bigframes library to use BigQuery DataFrames.
Before reading the pandas DataFrames into BigQuery DataFrames, rename the columns as per Google's schema guidelines. (Please note that at the time of publication, the feature may not be GA).
For more information on using Google Cloud Bigquery DataFrames, visit the Google Cloud documentation.

Conclusion

Creating a robust pipeline between MongoDB Atlas and BigQuery using PyMongoArrow and pandas-gbq opens up a world of possibilities for efficient data movement and analysis. This integration allows for the seamless transfer of data, enabling organizations to leverage the strengths of both platforms for comprehensive data analytics and decision-making.

Further reading


Facebook Icontwitter iconlinkedin icon
Rate this tutorial
star-empty
star-empty
star-empty
star-empty
star-empty
Related
Tutorial

Getting Started with MongoDB and AWS Codewhisperer


Apr 02, 2024 | 3 min read
Article

Resumable Initial Sync in MongoDB 4.4


May 16, 2022 | 5 min read
Quickstart

MongoDB Change Streams with Python


Sep 23, 2022 | 9 min read
Article

MongoDB Orchestration With Spring & Atlas Kubernetes Operator


May 06, 2024 | 13 min read
Table of Contents
  • Advantages of a Python-based solution