BlogAnnounced at MongoDB.local NYC 2024: A recap of all announcements and updatesLearn more >>
MongoDB Developer
Atlas
plus
Sign in to follow topics
MongoDB Developer Centerchevron-right
Developer Topicschevron-right
Productschevron-right
Atlaschevron-right

Stream Data Into MongoDB Atlas Using AWS Glue

BS
Babu Srinivasan6 min read • Published Apr 16, 2024 • Updated Apr 16, 2024
AWSAtlas
Facebook Icontwitter iconlinkedin icon
Rate this tutorial
star-empty
star-empty
star-empty
star-empty
star-empty
In this tutorial, you'll find a tangible showcase of how AWS Glue, Amazon Kinesis, and MongoDB Atlas seamlessly integrate, creating a streamlined data streaming solution alongside extract, transform, and load (ETL) capabilities. This repository also harnesses the power of AWS CDK to automate deployment across diverse environments, enhancing the efficiency of the entire process.
To follow along with this tutorial, you should have intermediate proficiency with AWS and MongoDB services.

Architecture diagram

architecture diagram
In the architecture described above, various streams of data, such as Order and Customer, are retrieved via the Amazon Kinesis stream. Subsequently, AWS Glue Studio is utilized to enrich the data. The enriched data is backed up in an S3 bucket, while the consolidated stream is stored in MongoDB Atlas and made accessible as data APIs for downstream systems.

Implementation steps

Prerequisites

  • AWS CLI installed and configured
  • NVM/NPM installed and configured
  • AWS CDK installed and configured
    • Python3 - yum install -y python3
    • Python Pip - yum install -y python-pip
    • Virtualenv - pip3 install virtualenv
This repo is developed taking us-east-1 as the default region. Please update the scripts to your specific region (if required). This repo will create a MongoDB Atlas project and a free-tier database cluster automatically. No need to create a database cluster manually. This repo is created for a demo purpose and IP access is not restricted (0.0.0.0/0). Ensure you strengthen the security by updating the relevant IP address (if required).

Setting up the environment

Get the application code

git clone https://github.com/mongodb-partners/Stream_Data_into_MongoDB_AWS_Glue cd kinesis-glue-aws-cdk

Prepare the dev environment to run AWS CDK

a. Set up the AWS Environment variable AWS Access Key ID, AWS Secret Access Key, and optionally, the AWS Session Token.
b. We will use CDK to make our deployments easier.
You should have npm pre-installed. If you don’t have CDK installed: npm install -g aws-cdk
Make sure you’re in the root directory. python3 -m venv .venv source .venv/bin/activate pip3 install -r requirements.txt
For development setup, use requirements-dev.txt.
c. Bootstrap the application with the AWS account.
cdk bootstrap
d. Set the ORG_ID as an environment variable in the .env file. All other parameters are set to default in global_args.py in the kinesis-glue-aws-cdk folder. MONGODB_USER and MONGODB_PASSWORD parameters are set directly in mongodb_atlas_stack.py and glue_job_stack.py
The below screenshot shows the location to get the Organization ID from MongoDB Atlas.
Organization ID in MongoDB Atlas
Please note that using "0.0.0.0/0" as an IP_ADDRESS, we are allowing access to the database from anywhere. This might be suitable for development or testing purposes but is highly discouraged for production environments because it exposes the database to potential attacks from unauthorized sources.
e. List the CDKs:
cdk ls
You should see an output of the available stacks:
aws-etl-glue-job-stack

Deploying the application

Let’s walk through each of the stacks:
Stack for MongoDB Atlas: aws-etl-mongo-atlas-stack This stack will create a MongoDB Atlas project and a free-tier database cluster with user and network permission (open).

Prerequisites

a. Create an AWS role with its trust relationship as a CloudFormation service.
Use the template to create a new CloudFormation stack to create the execution role.
Creating a new CloudFormation stsack
b.The following Public Extension in the CloudFormation Registry should be activated with the Role created in the earlier step. After logging into the AWS console, use this link to register extensions on CloudFormation.
Pass the ARN of the role from the earlier step as input to activate the MongoDB resource in Public Extension.
MongoDB Resource Activation in Public Extension:
Activation of Registry Public Extensions
The above screenshot shows the activation of the Registry Public Extensions for the MongoDB::Atlas::Cluster.
Alternatively, you can activate the above public extension through AWS CLI.
Command to list the MongoDB Public Extensions. Note down the arns for the above four public extensions.
Command to activate the Public Extension. Use this command to activate all four public extensions mentioned in the previous steps.
c. Log into the MongoDB console and note down the organization ID. Ensure the Organization ID is updated in the global_args.py
Cluster organization settings
The above screenshot shows the MongoDB Cluster Organization settings.
d. Create an API Key in an organization with Organization Owner access. Note down the API credentials.
Edit API key
The above screenshot shows the access managers for the API Key created in the MongoDB Atlas cluster.
Restrict the access to the Organization API with the API Access list. We provided an open access 0.0.0.0/1 for demo purposes only. We strongly discourage the use of this in any production environment or equivalent.
e. A profile should be created in the AWS Secrets Manager containing the MongoDB Atlas Programmatic API Key.
Use the template to create a new CloudFormation stack for the default profile that all resources will attempt to use unless a different override is specified.

Profile secret stack

AWS CloudFormation stack
The above screenshot shows the parameters for the AWS CloudFormation stack.
Initiate the deployment with the following command:
After successfully deploying the stack, validate the Outputs section of the stack and MongoDB Atlas cluster. You will find the stdUrl and stdSrvUrl for the connection string.

Stack:

Output of CloudFormation stack
The above screenshot shows the output of the CloudFormation stack.

MongoDB Atlas cluster:

Creation of a MongoDB Atlas cluster
The above screenshot shows the successful creation of the MongoDB Atlas cluster.

Stack for creating the Kinesis stream: aws-etl-kinesis-stream-stack

This stack will create two Kinesis data streams. Each producer runs an ingesting stream of events for different customers with their orders.
Initiate the deployment with the following command:
After successfully deploying the stack, Check the Outputs section of the stack. You will find the CustomerOrderKinesisDataStream Kinesis function.

Stack:

Output of the CloudFormation stack
The above screenshot shows the output of the CloudFormation stack for Kinesis streams.

Amazon Kinesis data stream:

Kinesis stream
The above screenshot shows the Kinesis stream created.

Stack for creating the S3 bucket: aws-etl-bucket-stack

This stack will create an S3 bucket that will be used by AWS Glue jobs to persist the incoming customer and order details.
After successfully deploying the stack, check the Outputs section of the stack. You will find the S3SourceBucket resource.

Stack:

Output of the CloudFormation stack
The above screenshot shows the Output of the CloudFormation stack.
AWS S3 Bucket:
S3 buckets created
The above screenshot shows the S3 buckets created.

Stack for creating the AWS Glue job and parameters: aws-etl-glue-job-stack

This stack will create two AWS Glue jobs: one job for the customer and another for the order. The code is in the location glue_job_stack/glue_job_scripts/customer_kinesis_streams_s3.py and glue_job_stack/glue_job_scripts/order_kinesis_streams_s3.py.

Stack:

Output of CloudFormation stack
The above screenshot shows the output of the CloudFormation stack.

AWS Glue job:

AWS Glue Studio
The above screenshot shows the AWS Glue Studio job to set up the job.

Note

The MongoDB URL of the newly created cluster and other parameters will be passed to the AWS Glue job programmatically. Update these parameters to your values (if required).
"Spark UI logs path" and "Temporary path" details will be maintained in the same bucket location with folder name /sparkHistoryLogs and /temporary.

Spark UI logs path:

s3://<S3_BUCKET_NAME>/sparkHistoryLogs

Temporary path:

s3://<S3_BUCKET_NAME>/temporary/

Screenshot of the AWS Glue parameters

AWS Glue parameters
The above screenshot shows the AWS Glue parameters.
Once you are ready with all stacks, start the producers for the customer and order. The code is in this location, to ingest data into a kinesis data stream and also start the Glue job for both.
producer/customer.py and producer/order.py

Sample record in MongoDB Atlas:

Clean up

Use cdk destroy to clean up all the AWS CDK resources.

Troubleshooting

Refer to GitHub to resolve some common issues encountered when using AWS CloudFormation/CDK with MongoDB Atlas Resources.

Useful commands

cdk ls lists all stacks in the app. cdk synth emits the synthesized CloudFormation template. cdk deploy deploys this stack to your default AWS account/region. cdk diff compares the deployed stack with the current state. cdk docs opens CDK documentation.

Facebook Icontwitter iconlinkedin icon
Rate this tutorial
star-empty
star-empty
star-empty
star-empty
star-empty
Related
Tutorial

Mastering the Advanced Features of the Data API with Atlas CLI


Feb 06, 2023 | 12 min read
Tutorial

Building a Restaurant Locator Using Atlas, Neurelo, and AWS Lambda


Apr 02, 2024 | 8 min read
Tutorial

Creating an API with the AWS API Lambda and the Atlas Data API


Jan 23, 2024 | 8 min read
Tutorial

Demystifying Stored Procedures in MongoDB


Feb 27, 2024 | 6 min read
Table of Contents
  • Architecture diagram