Machine Learning in the Cloud: Fully Automated Data Pipeline Part 3

Machine Learning in the Cloud: Fully Automated Data Pipeline - Part 3

Symphony Logo
Symphony
May 25th, 2022

Welcome to the third part of our blog series. In the previous part, we introduced AWS Glue DataBrew service that was used for cleaning and transforming our training and test data. In this blog post, we are going to set up CI/CD pipelines in order to automate the release process of our recipes using AWS CodePipeline, AWS CodeDeploy, AWS CodeCommit and AWS Lambda. Let’s start off by introducing these services!

AWS Lambda

AWS Lambda is a serverless compute service and sits at an intersection of a couple of different concepts in the industry. As microservices have grown in popularity in the last decade or so, that has led to the concept of Event-driven compute. Event-driven compute uses events to trigger and communicate with our application. These events are directly aligned and execute different functionalities, which make up our application as a whole. This execution model introduced Functions-as-a-Service (FaaS). AWS Lambda is a serverless FaaS that lets you execute code for any type of backend service without provisioning or managing servers. In our example project, we are going to use AWS Lambda to update and publish new revisions of recipes to our AWS Glue DataBrew environment.

AWS lambda

AWS CodeCommit

AWS CodeCommit is a source and version control service that offers high scalability and security which is used to host private Git repositories. CodeCommit was designed to make it easy for teams to securely collaborate on code and eliminates the need to manage your own source control system or worry about scaling its infrastructure. In our example project, we are going to create an AWS CodeCommit repository to track, manage, and maintain changes on our DataBrew recipes.

AWS CodeDeploy

AWS CodeDeploy is a fully-managed deployment service that automates software deployments to various AWS compute services. AWS CodeDeploy makes it easier to roll out new app features quickly. We will use it solely as a “dummy” phase in our CodePipeline (more about that later).

AWS CodePipeline

AWS CodePipeline is a fully-managed CI/CD service that helps us automate our release pipelines for fast and reliable application and infrastructure changes. In our example project, we are going to leverage AWS CodePipeline to build our custom CI/CD pipeline using the previously mentioned AWS services as its primary components (Lambda, CodeCommit, and CodeDeploy).

CI/CD Pipeline Overview

In this section, we present a solution that uses AWS CodePipeline to automatically deploy DataBrew recipes maintained in an AWS CodeCommit repository. The pipeline is activated once the user creates a change to a recipe and pushes it through CodeCommit. AWS Lambda then updates and publishes a new version of the recipe to DataBrew services in both staging and production environments.

Since a multi-account AWS environment is the best practice that provides a higher level of resource isolation, our pipeline should have three environments as presented in the following diagram:

  • Source environment
  • Staging environment
  • Production environment

Keep in mind that, based on your specific business requirements, it is recommended to add verification and test steps to your pipeline.

AWS pipeline

The steps in the pipeline are outlined as follows:

  • User creates a change to a DataBrew recipe and pushes its JSON definition to a CodeCommit repository
  • The recipe change triggers a CodePipeline transition to the Staging environment
  • CodeDeploy pushes the updated recipe artifacts to Amazon S3
  • CodeDeploy triggers the AWS Lambda deployer
  • Recipe is deployed to DataBrew in the staging environment using AWS Lambda
  • CodePipeline transitions to production environment, and repeats steps 3 - 5

Prerequisites

  • Git client installed locally
  • git-remote-codecommit Python package installed locally
  • Four AWS accounts:
  • Infrastructure -  account containing infrastructure resources, including CodeCommit repo, CodePipeline pipeline, CodeDeploy app, Lambda function and corresponding IAM permissions.
  • User - account interpreting users from your organization
  • Staging - account containing staging DataBrew service. Make sure to replicate the steps described in Part 2 of this blog series in this account.
  • Production - account containing production DataBrew service. Make sure to replicate the steps described in Part 2 of this blog series in this account.

Create a CodeCommit repository

In order to setup a CodeCommit repository, follow these next few steps:

  • Once you have opened the CodeCommit service, click on the “Create Repository” button as shown in the image below
AWS CodeCommit repository
  • Choose an arbitrary repository name. In our example, we are going to use “AirlinePassengerSatisfaction-Recipes-Repo”
AWS CodeCommit repository settings
  • Click the “Create” button
  • Create a README file with all the necessary prerequisite steps and push it to the main branch
  • You’re all set!

Create the multi-account access roles

We need IAM roles to delegate multi-account access to different resources. In order to achieve that, we create the following roles and policies:

  • Create policy and role for the multi-account recipe access in the staging and production account IAM consoles, by naming them “StagingMultiAccountRecipeAccessRole” and “ProdMultiAccountRecipeAccessRole”, and by adding the permissions found on this link to the JSON tab
  • Create policy and role for the multi-account repository access in the infrastructure account IAM console, by naming it “MultiAccountRepositoryContributorRole” and by adding the permissions found on this link to the JSON tab. Make sure to add your infrastructure account ID to the JSON file

Create policy, role and user for the multi-account repository access in the user account IAM console:

  • Select the Programmatic and AWS Management Console access types
  • Add the permissions found on this link to the JSON tab. Make sure to add your infrastructure account ID to the JSON file
  • After creating the user make sure to download the .csv file containing the user credentials
  • Open the created user and click on “Security credentials”. Under “HTTPS Git credentials for AWS CodeCommit” click “Generate credentials”

Configure the AWS CLI on your local machine:

  • Open a command line or terminal window, and start the AWS CLI configuration:
  • > aws configure When prompted, provide the following information. The access and secret keys are found in the .csv user credentials file.
  • > AWS Access Key ID [None]: user-access-key> AWS Secret Access Key ID [None]: user-secret-access-key> Default region name ID [None]: us-east-1> Default output format [None]: json Open the AWS CLI configuration file in an arbitrary text editor. If your local machine is running a Linux, macOS, or Unix operating system, this file is located at “~/.aws/config”. If on the other hand you are using Windows OS, this file is located at “drive:\Users\YOUR_USERNAME\.aws\config”. Enter the following entries to the file:

[default]

account = user-account-id

region = us-east-1

output = json

[profile MultiAccountAccessProfile]

account = infrastructure-account-ID

region = us-east-1

output = json

role_arn = arn:aws:iam::infrastructure-account-ID:role/MultiAccountRepositoryContributorRole

source_profile = default

  • Save the changes and close the editor
  • Clone your shared repository using “git clone” command

Push recipes to CodeCommit repository

In Part 2 of our blog series, we presented recipes and described how they are used. Each recipe can be downloaded and applied to multiple datasets. In this section, we show how to download them and push them to our newly created CodeCommit repository.

Create an initial commit

  • Sign in to the AWS Management Console using the user account
  • Open the DataBrew service, and switch to “Recipes” section in the navigation bar
  • Select the recipe to download and choose “Download as JSON”
AWS select recipe
  • Run the following commands in your command line prompt or terminal:
  • > mv Downloads/your-recipe-name.json AirlinePassengerSatisfaction-Recipes-Repo/> cd AirlinePassengerSatisfaction-Recipes-Repo/> git checkout -b feature-branch-name> git add .> git commit -m “your-commit-message”> git push –set-upstream origin feature-branch-name

Create a pull request

  • Sign in to the AWS Management Console using the multi-account repository contributor role. This can be easily achieved using this link:
  • https://signin.aws.amazon.com/switchrole?account=infrastructure-account-ID&roleName=MultiAccountRepositoryContributorRole Open the AirlinePassengerSatisfaction-Recipes-Repo and navigate to “Pull requests”
  • Click “Create pull request”
  • For “Source” select your feature-branch-name
  • For “Destination” select your main branch
  • Add description if necessary
  • Click “Create pull request”
AWS recipe pull request

Any other user account can view, review and merge this pull request.

Create the Lambda deployer function

AWS CodePipeline supports using custom actions to handle specific cases, which cannot be achieved using provided default actions. We will take advantage of this ability and create our custom Lambda deployer that will update and publish recipes to our DataBrew instances.

In order to set up our Lambda function, we need to complete the following steps:

  • Sign in to the AWS Management Console using the infrastructure account
  • Open the Lambda service console and create a function with the name “Prod-DeployRecipe”
AWS create lambda function
  • For “Runtime” select a programming language of your preference. If you want to use the code from this example project, choose Python 3.8.
AWS lambda runtime
  • Click “Create Function”
  • In the Lamba code editor, open the script from this link and add the sample code. In short, this code reads the recipe steps from the CodeDeploy artifacts S3 bucket, then assumes the multi-account role to update and publish the recipe with the same name
  • Click “Deploy”
AWS deploy lambda function
  • In the “Configuration” tab add a new environment variable, with the key as the role and value as the ARN of the production multi-account role
  • In the “Configuration” tab open the permissions tab
  • Navigate to the IAM console through the edit “Exection role” section
  • Create policy and add the contents from this link to the JSON tab. Make sure to add the Lambda function name, infrastructure account ID, Production account ID, and role name to the JSON file 
  • Click “Review policy”, name it and finally click “Create policy”

Repeat the same steps to create the Staging custom Lambda deployer function. Make sure to use a different name and the staging multi-account role in the JSON file.

Create a three-stage CI/CD pipeline

In this section, we will go through the steps of setting up our pipeline presented in the introductory part of this blog. As we already saw, our pipeline consists of three phases: Source, Staging and Production phase.

Create an application in CodeDeploy

A CodeDeploy application is a container for the software you intend to deploy. We will use this application within our pipeline to automatically deploy recipes to DataBrew.

It’s important to note that in this step, we only create a “dummy” CodeDeploy application and deployment group, because we need to specify them in the initial CodePipeline setup. Once the pipeline is set up, we will delete this step from the pipeline and add our custom Lambda function deployers.

In order to set up our CodeDeploy application, we need to complete the following steps:

  • Sign in to the AWS Management Console using the infrastructure account
  • Open the CodeDeploy service console and click on “Applications”
  • Click “Create Application”
  • Enter “AirlinePassengerSatisfaction-Recipe-Application” as the application name
  • Enter “AWS Lambda” as the compute platform
  • Click “Create Application”
AWS CodeDeploy create application
  • Open the created application and click on “Create Deployment Group”
  • Enter “AirlinePassengerSatisfaction-Recipe-Deployment-Group” as the name.
  • Choose any role with minimal access for the service role in the deployment group (since the application and group will soon be deleted). You can use this short tutorial to make this happen
  • Click “Create Deployment Group”

Create a pipeline

In order to set up our pipeline, we need to complete the following steps:

  • Sign in to the AWS Management Console using the infrastructure account
  • Open the CodePipeline service console and click on “Pipelines”
  • Click “Create Pipeline”
  • Enter the pipeline name and select “New service role” as seen in the image below. The role name is autogenerated.
AWS CodePipeline settings
  • Click “Next”
  • Select “AWS CodeCommit” as your source provider and specify the name of the repository you have created in one of the previous steps. Also, make sure to select “main” as your target branch
AWS CodeCommit stage
  • Click “Skip build stage”
  • Select “AWS CodeDeploy” as your deploy provider and specify the name and deployment group of the CodeDeploy application you have created previously
AWS CodeDeploy add deploy stage
  • Click “Next”
  • Review the inserted information and click “Create pipeline”

Add Lambda deployers to the pipeline

Since our initial pipeline is created, we can proceed to add phases to invoke our custom Lambda deployers.

Sign in to the AWS Management Console using the infrastructure accountOpen the CodePipeline service console and click on “Pipelines”Click on the pipeline created in the previous stepClick “Edit” and delete the existing Deploy stage

AWS CodePipeline edit pipeline
  • Click “+ Add Stage” to create the Staging deployment phase
  • Enter “Staging-Deploy” as the name
  • Click “Add Stage”
  • Click “+ Add action group”
  • Enter the information as seen in the image below.
AWS CodePipeline actions
  • Click “Done”

Repeat the same steps to create the Production deployment phase, but keep in mind to use different naming conventions.

The resulting pipeline should look like this:

AWS pipeline finished

Don’t forget to delete that dummy CodeDeploy app and deployment group!

Let’s test it out!

Our CI/CD pipeline is ready! We merge our previously created pull request and release the recipe changes through our pipeline. The steps are similar as when we created the pull request:

  • Sign in to the AWS Management Console using the multi-account repository contributor role. This can be easily achieved using this link:
  • https://signin.aws.amazon.com/switchrole?account=infrastructure-account-ID&roleName=MultiAccountRepositoryContributorRole Open the AirlinePassengerSatisfaction-Recipes-Repo and navigate to “Pull requests”
  • Click on the pull request created previously
  • Review the changes and merge the pull request
  • Sign in to the infrastructure account to see the pipeline deploy the changes. Afterwards, you can sign in to the Staging or Production account to check the updated recipes in DataBrew.

    Thank you, once again, for your attention. That’s all for Part 3 of our blog series. In the next one, we will learn how to build a machine learning model using processed data from DataBrew in AWS SageMaker.

    About the author

    Damir Varesanovic is a Software Engineer working at our engineering hub in Bosnia and Herzegovina.

    Damir’s areas of expertise are Data Science and Software Engineering. He is proficient in Python, and he has deep knowledge of algorithms, SQL, and applied statistics. He is also experienced in the area of backend frameworks, such as Spring, Android, .NET, Django, Flask, and Node.js, and he used them in various projects during his professional career.

Contact us if you have any questions about our company or products.

We will try to provide an answer within a few days.