Machine Learning in the Cloud: Fully Automated Data Pipeline - Part 3

Symphony

May 25th, 2022

Welcome to the third part of our blog series. In the previous part, we introduced AWS Glue DataBrew service that was used for cleaning and transforming our training and test data. In this blog post, we are going to set up CI/CD pipelines in order to automate the release process of our recipes using AWS CodePipeline, AWS CodeDeploy, AWS CodeCommit and AWS Lambda. Let’s start off by introducing these services!

AWS Lambda

AWS Lambda is a serverless compute service and sits at an intersection of a couple of different concepts in the industry. As microservices have grown in popularity in the last decade or so, that has led to the concept of Event-driven compute. Event-driven compute uses events to trigger and communicate with our application. These events are directly aligned and execute different functionalities, which make up our application as a whole. This execution model introduced Functions-as-a-Service (FaaS). AWS Lambda is a serverless FaaS that lets you execute code for any type of backend service without provisioning or managing servers. In our example project, we are going to use AWS Lambda to update and publish new revisions of recipes to our AWS Glue DataBrew environment.

AWS CodeCommit

AWS CodeCommit is a source and version control service that offers high scalability and security which is used to host private Git repositories. CodeCommit was designed to make it easy for teams to securely collaborate on code and eliminates the need to manage your own source control system or worry about scaling its infrastructure. In our example project, we are going to create an AWS CodeCommit repository to track, manage, and maintain changes on our DataBrew recipes.

AWS CodeDeploy

AWS CodeDeploy is a fully-managed deployment service that automates software deployments to various AWS compute services. AWS CodeDeploy makes it easier to roll out new app features quickly. We will use it solely as a “dummy” phase in our CodePipeline (more about that later).

AWS CodePipeline

AWS CodePipeline is a fully-managed CI/CD service that helps us automate our release pipelines for fast and reliable application and infrastructure changes. In our example project, we are going to leverage AWS CodePipeline to build our custom CI/CD pipeline using the previously mentioned AWS services as its primary components (Lambda, CodeCommit, and CodeDeploy).

CI/CD Pipeline Overview

In this section, we present a solution that uses AWS CodePipeline to automatically deploy DataBrew recipes maintained in an AWS CodeCommit repository. The pipeline is activated once the user creates a change to a recipe and pushes it through CodeCommit. AWS Lambda then updates and publishes a new version of the recipe to DataBrew services in both staging and production environments.

Since a multi-account AWS environment is the best practice that provides a higher level of resource isolation, our pipeline should have three environments as presented in the following diagram:

Source environment
Staging environment
Production environment

Keep in mind that, based on your specific business requirements, it is recommended to add verification and test steps to your pipeline.

The steps in the pipeline are outlined as follows:

User creates a change to a DataBrew recipe and pushes its JSON definition to a CodeCommit repository
The recipe change triggers a CodePipeline transition to the Staging environment
CodeDeploy pushes the updated recipe artifacts to Amazon S3
CodeDeploy triggers the AWS Lambda deployer
Recipe is deployed to DataBrew in the staging environment using AWS Lambda
CodePipeline transitions to production environment, and repeats steps 3 - 5

Prerequisites

Git client installed locally
git-remote-codecommit Python package installed locally
Four AWS accounts:

Infrastructure - account containing infrastructure resources, including CodeCommit repo, CodePipeline pipeline, CodeDeploy app, Lambda function and corresponding IAM permissions.
User - account interpreting users from your organization
Staging - account containing staging DataBrew service. Make sure to replicate the steps described in Part 2 of this blog series in this account.
Production - account containing production DataBrew service. Make sure to replicate the steps described in Part 2 of this blog series in this account.

Create a CodeCommit repository

In order to setup a CodeCommit repository, follow these next few steps:

Once you have opened the CodeCommit service, click on the “Create Repository” button as shown in the image below

Choose an arbitrary repository name. In our example, we are going to use “AirlinePassengerSatisfaction-Recipes-Repo”

Click the “Create” button
Create a README file with all the necessary prerequisite steps and push it to the main branch
You’re all set!

Create the multi-account access roles

We need IAM roles to delegate multi-account access to different resources. In order to achieve that, we create the following roles and policies:

Create policy and role for the multi-account recipe access in the staging and production account IAM consoles, by naming them “StagingMultiAccountRecipeAccessRole” and “ProdMultiAccountRecipeAccessRole”, and by adding the permissions found on this link to the JSON tab
Create policy and role for the multi-account repository access in the infrastructure account IAM console, by naming it “MultiAccountRepositoryContributorRole” and by adding the permissions found on this link to the JSON tab. Make sure to add your infrastructure account ID to the JSON file

Create policy, role and user for the multi-account repository access in the user account IAM console:

Select the Programmatic and AWS Management Console access types
Add the permissions found on this link to the JSON tab. Make sure to add your infrastructure account ID to the JSON file
After creating the user make sure to download the .csv file containing the user credentials
Open the created user and click on “Security credentials”. Under “HTTPS Git credentials for AWS CodeCommit” click “Generate credentials”

Configure the AWS CLI on your local machine:

Open a command line or terminal window, and start the AWS CLI configuration:
> aws configure When prompted, provide the following information. The access and secret keys are found in the .csv user credentials file.
> AWS Access Key ID [None]: user-access-key> AWS Secret Access Key ID [None]: user-secret-access-key> Default region name ID [None]: us-east-1> Default output format [None]: json Open the AWS CLI configuration file in an arbitrary text editor. If your local machine is running a Linux, macOS, or Unix operating system, this file is located at “~/.aws/config”. If on the other hand you are using Windows OS, this file is located at “drive:\Users\YOUR_USERNAME\.aws\config”. Enter the following entries to the file:

[default]

account = user-account-id

region = us-east-1

output = json

[profile MultiAccountAccessProfile]

account = infrastructure-account-ID

region = us-east-1

output = json

role_arn = arn:aws:iam::infrastructure-account-ID:role/MultiAccountRepositoryContributorRole

source_profile = default

Save the changes and close the editor
Clone your shared repository using “git clone” command

Push recipes to CodeCommit repository

In Part 2 of our blog series, we presented recipes and described how they are used. Each recipe can be downloaded and applied to multiple datasets. In this section, we show how to download them and push them to our newly created CodeCommit repository.

Create an initial commit

Sign in to the AWS Management Console using the user account
Open the DataBrew service, and switch to “Recipes” section in the navigation bar
Select the recipe to download and choose “Download as JSON”

Run the following commands in your command line prompt or terminal:
> mv Downloads/your-recipe-name.json AirlinePassengerSatisfaction-Recipes-Repo/> cd AirlinePassengerSatisfaction-Recipes-Repo/> git checkout -b feature-branch-name> git add .> git commit -m “your-commit-message”> git push –set-upstream origin feature-branch-name

Create a pull request

Sign in to the AWS Management Console using the multi-account repository contributor role. This can be easily achieved using this link:
https://signin.aws.amazon.com/switchrole?account=infrastructure-account-ID&roleName=MultiAccountRepositoryContributorRole Open the AirlinePassengerSatisfaction-Recipes-Repo and navigate to “Pull requests”
Click “Create pull request”
For “Source” select your feature-branch-name
For “Destination” select your main branch
Add description if necessary
Click “Create pull request”

Any other user account can view, review and merge this pull request.

Create the Lambda deployer function

AWS CodePipeline supports using custom actions to handle specific cases, which cannot be achieved using provided default actions. We will take advantage of this ability and create our custom Lambda deployer that will update and publish recipes to our DataBrew instances.

In order to set up our Lambda function, we need to complete the following steps:

Sign in to the AWS Management Console using the infrastructure account
Open the Lambda service console and create a function with the name “Prod-DeployRecipe”

For “Runtime” select a programming language of your preference. If you want to use the code from this example project, choose Python 3.8.

Click “Create Function”
In the Lamba code editor, open the script from this link and add the sample code. In short, this code reads the recipe steps from the CodeDeploy artifacts S3 bucket, then assumes the multi-account role to update and publish the recipe with the same name
Click “Deploy”

In the “Configuration” tab add a new environment variable, with the key as the role and value as the ARN of the production multi-account role
In the “Configuration” tab open the permissions tab
Navigate to the IAM console through the edit “Exection role” section
Create policy and add the contents from this link to the JSON tab. Make sure to add the Lambda function name, infrastructure account ID, Production account ID, and role name to the JSON file
Click “Review policy”, name it and finally click “Create policy”

Repeat the same steps to create the Staging custom Lambda deployer function. Make sure to use a different name and the staging multi-account role in the JSON file.

Create a three-stage CI/CD pipeline

In this section, we will go through the steps of setting up our pipeline presented in the introductory part of this blog. As we already saw, our pipeline consists of three phases: Source, Staging and Production phase.

Create an application in CodeDeploy

A CodeDeploy application is a container for the software you intend to deploy. We will use this application within our pipeline to automatically deploy recipes to DataBrew.

It’s important to note that in this step, we only create a “dummy” CodeDeploy application and deployment group, because we need to specify them in the initial CodePipeline setup. Once the pipeline is set up, we will delete this step from the pipeline and add our custom Lambda function deployers.

In order to set up our CodeDeploy application, we need to complete the following steps:

Sign in to the AWS Management Console using the infrastructure account
Open the CodeDeploy service console and click on “Applications”
Click “Create Application”
Enter “AirlinePassengerSatisfaction-Recipe-Application” as the application name
Enter “AWS Lambda” as the compute platform
Click “Create Application”

Open the created application and click on “Create Deployment Group”
Enter “AirlinePassengerSatisfaction-Recipe-Deployment-Group” as the name.
Choose any role with minimal access for the service role in the deployment group (since the application and group will soon be deleted). You can use this short tutorial to make this happen
Click “Create Deployment Group”

Create a pipeline

In order to set up our pipeline, we need to complete the following steps:

Sign in to the AWS Management Console using the infrastructure account
Open the CodePipeline service console and click on “Pipelines”
Click “Create Pipeline”
Enter the pipeline name and select “New service role” as seen in the image below. The role name is autogenerated.

Click “Next”
Select “AWS CodeCommit” as your source provider and specify the name of the repository you have created in one of the previous steps. Also, make sure to select “main” as your target branch

Click “Skip build stage”
Select “AWS CodeDeploy” as your deploy provider and specify the name and deployment group of the CodeDeploy application you have created previously

Click “Next”
Review the inserted information and click “Create pipeline”

Add Lambda deployers to the pipeline

Since our initial pipeline is created, we can proceed to add phases to invoke our custom Lambda deployers.

Sign in to the AWS Management Console using the infrastructure accountOpen the CodePipeline service console and click on “Pipelines”Click on the pipeline created in the previous stepClick “Edit” and delete the existing Deploy stage

Click “+ Add Stage” to create the Staging deployment phase
Enter “Staging-Deploy” as the name
Click “Add Stage”
Click “+ Add action group”
Enter the information as seen in the image below.

Click “Done”

Repeat the same steps to create the Production deployment phase, but keep in mind to use different naming conventions.

The resulting pipeline should look like this:

Don’t forget to delete that dummy CodeDeploy app and deployment group!

Let’s test it out!

Our CI/CD pipeline is ready! We merge our previously created pull request and release the recipe changes through our pipeline. The steps are similar as when we created the pull request:

Sign in to the AWS Management Console using the multi-account repository contributor role. This can be easily achieved using this link:
https://signin.aws.amazon.com/switchrole?account=infrastructure-account-ID&roleName=MultiAccountRepositoryContributorRole Open the AirlinePassengerSatisfaction-Recipes-Repo and navigate to “Pull requests”
Click on the pull request created previously
Review the changes and merge the pull request

Sign in to the infrastructure account to see the pipeline deploy the changes. Afterwards, you can sign in to the Staging or Production account to check the updated recipes in DataBrew.

Thank you, once again, for your attention. That’s all for Part 3 of our blog series. In the next one, we will learn how to build a machine learning model using processed data from DataBrew in AWS SageMaker.

About the author

Damir Varesanovic is a Software Engineer working at our engineering hub in Bosnia and Herzegovina.

Damir’s areas of expertise are Data Science and Software Engineering. He is proficient in Python, and he has deep knowledge of algorithms, SQL, and applied statistics. He is also experienced in the area of backend frameworks, such as Spring, Android, .NET, Django, Flask, and Node.js, and he used them in various projects during his professional career.

Contact us if you have any questions about our company or products.

We will try to provide an answer within a few days.

I agree to the Terms & Conditions and Privacy Policy