BlogTechnology and KnowHow

Machine Learning in the Cloud: Fully Automated Data Pipeline

Symphony logo
Symphony
April 19th, 2022

Introduction

As cloud services become more and more popular, the potential for building machine learning applications in the cloud also increases. In this blog post, we'll focus on Amazon Web Services (AWS) and explore how to set up a fully automated data pipeline for a machine learning application using some of these services. AWS provides every developer and data scientist a broad set of machine learning services, enabling them to create impactful solutions faster. We'll cover everything from dataset collection and preparation to model training and deployment. By the end of this article, you'll have all the tools you need to get started with machine learning in the cloud.

The majority of machine learning experiments begin with understanding your data on your desktop or laptop and don't require too much computing resources. However, you will quickly find yourself in need of more resources than your local CPU can give. By far the most scalable environment for machine learning is the cloud. You'll have access to the latest GPUs (and even TPUs) that you couldn't purchase or maintain on your own.

Some of the main reasons to do machine learning in the cloud:

  • The cloud’s pay-per-use model is excellent for spiky AI or machine learning workloads.
  • Cloud providers make it easy to experiment with machine learning services and scale up as projects and demand increases.
  • Cloud providers make their state-of-the-art services accessible without requiring advanced skills in AI, data science, and, in some cases, even without programming at all.
  • At the click of a mouse, you'll have access to the newest hardware.

But keep in mind, on-premises servers are a valid alternative when your needs fall in the following criteria:

  • You need calculation capacity 24/7.
  • You have sensitive data that cannot leave your data centers for compliance or other reasons.

Problem Description

In this example project, we are attempting to solve a problem that is common among businesses: Customer Satisfaction. To be more exact, we are going to create a model that predicts whether a customer was Satisfied or Unsatisfied with the experience and/or service which an airline provided. The dataset used for this project is accessible on Kaggle and can be downloaded on this link. This dataset has already been split into train and test CSV files. 80% of the total dataset is in train.csv and 20% is in test.csv.

Important thing to note is that our goal is to present the cloud services for machine learning and show how easy it is to get started with some of them. Therefore, we are going to keep the implementation of the model as simple as possible. Feel free to play around with this dataset and drop a comment if you found a way to optimize the solution.

So, we will break down our problem and blog series into three steps as follows:

  • Clean and transform the dataset, in order to prepare it for further analysis and model implementation. For this step, we are going to use AWS Glue DataBrew.
  • Set up CI/CD pipelines in order to automate the release process of our recipes using AWS CodePipeline, AWS CodeCommit, and AWS Lambda.
  • Build a ML model using AWS SageMaker.

That’s all for Part 1 of our blog series. In the next one, we start off with the first step and introduce AWS Glue DataBrew and learn how to transform and prepare our data for further use.

About the author

Damir Varesanovic is a Software Engineer working at our engineering hub in Bosnia and Herzegovina.

Damir’s areas of expertise are Data Science and Software Engineering. He is proficient in Python, and he has deep knowledge of algorithms, SQL, and applied statistics. He is also experienced in the area of backend frameworks, such as Spring, Android, .NET, Django, Flask, and Node.js, and he used them in various projects during his professional career.