Stream Data Into MongoDB Atlas Using AWS Glue

Babu Srinivasan6 min read • Published Apr 16, 2024 • Updated Apr 16, 2024

AWS Atlas

Rate this tutorial

In this tutorial, you'll find a tangible showcase of how AWS Glue, Amazon Kinesis, and MongoDB Atlas seamlessly integrate, creating a streamlined data streaming solution alongside extract, transform, and load (ETL) capabilities. This repository also harnesses the power of AWS CDK to automate deployment across diverse environments, enhancing the efficiency of the entire process.

To follow along with this tutorial, you should have intermediate proficiency with AWS and MongoDB services.

Architecture diagram

In the architecture described above, various streams of data, such as Order and Customer, are retrieved via the Amazon Kinesis stream. Subsequently, AWS Glue Studio is utilized to enrich the data. The enriched data is backed up in an S3 bucket, while the consolidated stream is stored in MongoDB Atlas and made accessible as data APIs for downstream systems.

Implementation steps

Prerequisites

AWS CLI installed and configured
NVM/NPM installed and configured
AWS CDK installed and configured
MongoDB Atlas account, with the Organization set up
Python packages
- Python3 - yum install -y python3
- Python Pip - yum install -y python-pip
- Virtualenv - pip3 install virtualenv

This repo is developed taking us-east-1 as the default region. Please update the scripts to your specific region (if required). This repo will create a MongoDB Atlas project and a free-tier database cluster automatically. No need to create a database cluster manually. This repo is created for a demo purpose and IP access is not restricted (0.0.0.0/0). Ensure you strengthen the security by updating the relevant IP address (if required).

Setting up the environment

Get the application code

git clone https://github.com/mongodb-partners/Stream_Data_into_MongoDB_AWS_Glue
cd kinesis-glue-aws-cdk

Prepare the dev environment to run AWS CDK

a. Set up the AWS Environment variable AWS Access Key ID, AWS Secret Access Key, and optionally, the AWS Session Token.

Code Snippet

b. We will use CDK to make our deployments easier.

You should have npm pre-installed. If you don’t have CDK installed: npm install -g aws-cdk

Make sure you’re in the root directory. python3 -m venv .venv source .venv/bin/activate pip3 install -r requirements.txt

For development setup, use requirements-dev.txt.

c. Bootstrap the application with the AWS account.

cdk bootstrap

d. Set the ORG_ID as an environment variable in the .env file. All other parameters are set to default in global_args.py in the kinesis-glue-aws-cdk folder. MONGODB_USER and MONGODB_PASSWORD parameters are set directly in mongodb_atlas_stack.py and glue_job_stack.py

The below screenshot shows the location to get the Organization ID from MongoDB Atlas.

Please note that using "0.0.0.0/0" as an IP_ADDRESS, we are allowing access to the database from anywhere. This might be suitable for development or testing purposes but is highly discouraged for production environments because it exposes the database to potential attacks from unauthorized sources.

e. List the CDKs:

cdk ls

You should see an output of the available stacks:

Code Snippet

aws-etl-glue-job-stack

Deploying the application

Let’s walk through each of the stacks:

Stack for MongoDB Atlas: aws-etl-mongo-atlas-stack This stack will create a MongoDB Atlas project and a free-tier database cluster with user and network permission (open).

Prerequisites

a. Create an AWS role with its trust relationship as a CloudFormation service.

Use the template to create a new CloudFormation stack to create the execution role.

b.The following Public Extension in the CloudFormation Registry should be activated with the Role created in the earlier step. After logging into the AWS console, use this link to register extensions on CloudFormation.

Code Snippet

Pass the ARN of the role from the earlier step as input to activate the MongoDB resource in Public Extension.

MongoDB Resource Activation in Public Extension:

The above screenshot shows the activation of the Registry Public Extensions for the MongoDB::Atlas::Cluster.

Alternatively, you can activate the above public extension through AWS CLI.

Command to list the MongoDB Public Extensions. Note down the arns for the above four public extensions.

Code Snippet

Command to activate the Public Extension. Use this command to activate all four public extensions mentioned in the previous steps.

Code Snippet

c. Log into the MongoDB console and note down the organization ID. Ensure the Organization ID is updated in the global_args.py

The above screenshot shows the MongoDB Cluster Organization settings.

d. Create an API Key in an organization with Organization Owner access. Note down the API credentials.

The above screenshot shows the access managers for the API Key created in the MongoDB Atlas cluster.

Restrict the access to the Organization API with the API Access list. We provided an open access 0.0.0.0/1 for demo purposes only. We strongly discourage the use of this in any production environment or equivalent.

e. A profile should be created in the AWS Secrets Manager containing the MongoDB Atlas Programmatic API Key.

Use the template to create a new CloudFormation stack for the default profile that all resources will attempt to use unless a different override is specified.

Profile secret stack

The above screenshot shows the parameters for the AWS CloudFormation stack.

Initiate the deployment with the following command:

Code Snippet

After successfully deploying the stack, validate the Outputs section of the stack and MongoDB Atlas cluster. You will find the stdUrl and stdSrvUrl for the connection string.

Stack:

The above screenshot shows the output of the CloudFormation stack.

MongoDB Atlas cluster:

The above screenshot shows the successful creation of the MongoDB Atlas cluster.

Stack for creating the Kinesis stream: aws-etl-kinesis-stream-stack

This stack will create two Kinesis data streams. Each producer runs an ingesting stream of events for different customers with their orders.

Initiate the deployment with the following command:

Code Snippet

After successfully deploying the stack, Check the Outputs section of the stack. You will find the CustomerOrderKinesisDataStream Kinesis function.

Stack:

The above screenshot shows the output of the CloudFormation stack for Kinesis streams.

Amazon Kinesis data stream:

The above screenshot shows the Kinesis stream created.

Stack for creating the S3 bucket: aws-etl-bucket-stack

This stack will create an S3 bucket that will be used by AWS Glue jobs to persist the incoming customer and order details.

Code Snippet

After successfully deploying the stack, check the Outputs section of the stack. You will find the S3SourceBucket resource.

Stack:

The above screenshot shows the Output of the CloudFormation stack.

AWS S3 Bucket:

The above screenshot shows the S3 buckets created.

Stack for creating the AWS Glue job and parameters: aws-etl-glue-job-stack

This stack will create two AWS Glue jobs: one job for the customer and another for the order. The code is in the location glue_job_stack/glue_job_scripts/customer_kinesis_streams_s3.py and glue_job_stack/glue_job_scripts/order_kinesis_streams_s3.py.

Code Snippet

Stack:

The above screenshot shows the output of the CloudFormation stack.

AWS Glue job:

The above screenshot shows the AWS Glue Studio job to set up the job.

Note

The MongoDB URL of the newly created cluster and other parameters will be passed to the AWS Glue job programmatically. Update these parameters to your values (if required).

"Spark UI logs path" and "Temporary path" details will be maintained in the same bucket location with folder name /sparkHistoryLogs and /temporary.

Spark UI logs path:

s3://<S3_BUCKET_NAME>/sparkHistoryLogs

Temporary path:

s3://<S3_BUCKET_NAME>/temporary/

Screenshot of the AWS Glue parameters

The above screenshot shows the AWS Glue parameters.

Once you are ready with all stacks, start the producers for the customer and order. The code is in this location, to ingest data into a kinesis data stream and also start the Glue job for both.

producer/customer.py and producer/order.py

Sample record in MongoDB Atlas:

Code Snippet

Clean up

Use cdk destroy to clean up all the AWS CDK resources.

Troubleshooting

Refer to GitHub to resolve some common issues encountered when using AWS CloudFormation/CDK with MongoDB Atlas Resources.

Useful commands

cdk ls lists all stacks in the app. cdk synth emits the synthesized CloudFormation template. cdk deploy deploys this stack to your default AWS account/region. cdk diff compares the deployed stack with the current state. cdk docs opens CDK documentation.

Rate this tutorial

Tutorial

Mastering the Advanced Features of the Data API with Atlas CLI

Feb 06, 2023 | 12 min read

Tutorial

Building a Restaurant Locator Using Atlas, Neurelo, and AWS Lambda

Apr 02, 2024 | 8 min read

Tutorial

Creating an API with the AWS API Lambda and the Atlas Data API

Jan 23, 2024 | 8 min read

Tutorial

Demystifying Stored Procedures in MongoDB

Feb 27, 2024 | 6 min read

Architecture diagram
Implementation steps
Clean up
Troubleshooting
Useful commands

Atlas

Stream Data Into MongoDB Atlas Using AWS Glue

Architecture diagram

Implementation steps

Prerequisites

Setting up the environment

Get the application code

Prepare the dev environment to run AWS CDK

Deploying the application

Prerequisites

Profile secret stack

Stack:

MongoDB Atlas cluster:

Stack for creating the Kinesis stream: aws-etl-kinesis-stream-stack

Stack:

Amazon Kinesis data stream:

Stack for creating the S3 bucket: aws-etl-bucket-stack

Stack:

Stack for creating the AWS Glue job and parameters: aws-etl-glue-job-stack

Stack:

AWS Glue job:

Note

Spark UI logs path:

Temporary path:

Screenshot of the AWS Glue parameters

Sample record in MongoDB Atlas:

Clean up

Troubleshooting

Useful commands

Related

Mastering the Advanced Features of the Data API with Atlas CLI

Building a Restaurant Locator Using Atlas, Neurelo, and AWS Lambda

Creating an API with the AWS API Lambda and the Atlas Data API

Demystifying Stored Procedures in MongoDB

Table of Contents