Back to Blog

How To Build an ML Platform from Scratch

Building an ML Platform from Scratch
Alon Gubkin Alon Gubkin
19 min read Jul 22, 2021

As your data science team grows and you start deploying models to production, the need for proper ML infrastructure becomes crucial – a standard way to design, train and deploy models.
In this guide, together we will build a basic ML Platform using open-source tools like Cookiecutter, DVC, MLFlow, FastAPI, Pulumi and more. We’ll also see how to monitor for model drift using Aporia. Final code is available on GitHub.
Keep in mind that this type of project can be huge – often taking a lot of time and resources – therefore our toy ML Platform won’t have tons of features – just the basics, but it should teach you the basic principles of how to build your own ML platform.
This how-to guide is based on the following workshop I did for the MLOps Community:

Architecture

Our toy ML Platform will use DVC for data versioning, MLFlow for experiments management, FastAPI for model serving, and Aporia for model monitoring.
We’re going to build all of this on top of AWS, but in theory you could also use Azure, Google Cloud or any other cloud provider.
It’s important to note that when building your own machine learning platform, you should NOT take these tools for granted. You should evaluate the alternatives – as they may be more appropriate for your specific use case.

Model Template

The first component in our machine learning platform is going to be the model template, which we’re going to build using Cookiecutter for templating, and Poetry for package management.
The idea is that when a data scientist starts working on a new project, they will clone our model template (which contains a standard folder structure, Python linting, etc), develop their model, and easily deploy it when it’s ready for production.
The machine learning models template will contain basic training and serving code.

Data & Experiment Tracking

The training code in the model template will use the MLFlow client to track experiments.
Those experiments will be sent to the MLFlow Server that we’ll run on top of Kubernetes (EKS).
The model artifact itself is going to be saved in a S3 Bucket (the Artifact Storage), and metadata about experiments will be saved in a PostgreSQL database.
We’ll also track versions of the dataset using DVC, in a S3 bucket.

Model Serving

For model serving, we will build a FastAPI server that’ll be responsible for preprocessing, making predictions, etc.
These model servers are going to run on Kubernetes, and we’ll expose them to the internet using Traefik.

Infrastructure as Code

All our infrastructure is going to be deployed using Pulumi, which is an Infrastructure as Code tool similar to Terraform.
If you aren’t familiar with the concept, you can read more about it here before continuing. Here are some major advantages of using this method:

  • Versioned: Your infrastructure is versioned, so if you have a bug you can easily revert it to a previous version.
  • Code Reviewed: Each change to the infrastructure can be code reviewed and you’re less prone to mistakes.
  • Sharable: You can easily share infrastructure components, just send the source code for the component.

With Pulumi, you can choose to write your infrastructure in a real programming language, such as TypeScript, Python, C# and more.
Even though the natural choice for an ML platform would be Python, I chose TypeScript because at the time of writing this post (July 2021), Pulumi’s implementation of TypeScript is more feature complete.

Repositories & CI/CD

We’re going to have 2 GitHub repositories:

  • mlplatform-infra – the Pulumi code for the shared infrastructure of the ML Platform. Infrastructure that isn’t model-specific. Kubernetes, MLFlow, S3 buckets, etc.
  • model-template – the model template code that data scientists can clone, including basic training code, FastAPI server, etc.

For CI/CD we’re going to use GitHub Actions.

Let’s start building!

Let’s start by setting up Kubernetes with MLFlow server.
Create a new GitHub repo called mlplatform-infra. This repo will contain the underlying infrastructure that’s shared between models.
Clone it and run:

pulumi new

Open index.ts and write the code for creating a new EKS cluster:

// Create a Kubernetes cluster.

const cluster = new eks.Cluster('mlplatform-eks', {
  createOidcProvider: true,
});
export const kubeconfig = cluster.kubeconfig;

The createOidcProvider is required because MLFlow is going to access the artifact storage (see architecture), which is a S3 bucket, so we need to create a Kubernetes ServiceAccount that can access S3 buckets.

There are more interesting arguments in this API, you should definitely check them out before deploying to production.

MLFlow Installation

To install packages like MLFlow on Kubernetes, we’re going to use Helm, which is a very popular package manager for Kubernetes.
Specifically, we’re going to use the labirras/mlflow Helm chart:

// Install MLFlow
const mlflow = new k8s.helm.v3.Chart("mlflow", {
  chart: "mlflow",
  fetchOpts: { repo: "https://larribas.me/helm-charts" },
}, { provider: cluster.provider });

This should install an MLFlow server, but unfortunately it’s not going to work because we still need to connect it to an artifact storage and a model metadata database.

Let’s create an S3 bucket for the artifact storage, and configure MLFlow to use it using the Helm chart’s values:

// Create S3 bucket for MLFlow artifact storage
const artifactStorage = new aws.s3.Bucket("artifact-storage", {
  acl: "public-read-write",
});

// Install MLFlow
const mlflow = new k8s.helm.v3.Chart("mlflow", {
  chart: "mlflow",
  values: {
    defaultArtifactRoot: artifactStorage.bucket.apply((bucketName: string) => `s3://${bucketName}`),
  },
  fetchOpts: { repo: "https://larribas.me/helm-charts" },
}, { provider: cluster.provider });

This should almost work, but there’s one problem: the MLFlow server, which is running on Kubernetes, isn’t going to have access to S3 buckets (see comment above on createOidcProvider).

To fix the permissions issue, let’s create a ServiceAccount that can access S3 buckets and use it:

// Create S3 bucket for MLFlow artifact storage
const artifactStorage = new aws.s3.Bucket("artifact-storage", {
  acl: "public-read-write",
});

// Create a k8s ServiceAccount that can access S3 buckets for MLFlow
const mlflowServiceAccount = new S3ServiceAccount('mlflow-service-account', {
  namespace: "default",
  oidcProvider: cluster.core.oidcProvider!,
  readOnly: false,
}, { provider: cluster.provider });

// Install MLFlow
const mlflow = new k8s.helm.v3.Chart("mlflow", {
  chart: "mlflow",
  values: {
    defaultArtifactRoot: artifactStorage.bucket.apply((bucketName: string) => `s3://${bucketName}`),
    serviceAccount: {
      create: false,
      name: mlflowServiceAccount.name,
    }
  },
  fetchOpts: { repo: "https://larribas.me/helm-charts" },
}, { provider: cluster.provider });

The S3ServiceAccount is available here.

Finally, we need to set up the Postgres database for model metadata. We’ll use RDS for that:

// Create Postgres database for MLFlow
const dbPassword = new random.RandomPassword('mlplatform-db-password', { length: 16, special: false });
const db = new aws.rds.Instance('mlflow-db', {
  allocatedStorage: 10,
  engine: "postgres",
  engineVersion: "11.10",
  instanceClass: "db.t3.medium",
  name: "mlflow",
  password: dbPassword.result,
  skipFinalSnapshot: true,
  username: "postgres",
  
  // Make sure EKS has access to this db
  vpcSecurityGroupIds: [cluster.clusterSecurityGroup.id, cluster.nodeSecurityGroup.id],
});

// Install MLFlow
const mlflow = new k8s.helm.v3.Chart("mlflow", {
  chart: "mlflow",
  values: {
    ....
    backendStore: {
      postgres: {
        username: db.username,
        password: db.password,
        host: db.address,
        port: db.port,
        database: "mlflow"
      }
    },
  },
  fetchOpts: { repo: "https://larribas.me/helm-charts" },
}, { provider: cluster.provider });

That’s it, run pulumi up and you should see MLFlow running on your new Kubernetes cluster!

Exposing K8s apps to the Internet using Traefik

Even though MLFlow is running successfully, you have no way of accessing it. Let’s open MLFlow to the internet! 

Security Warning

In real life you shouldn’t just expose MLFlow to the Internet of course 🙂 Instead, think of an authentication & authorization mechanisms that work for your organization.

To expose Kubernetes applications to the Internet, we are going to use Traefik – an open-source API gateway.
To install Traefik:

const traefik = new k8s.helm.v3.Chart('traefik', {
  chart: 'traefik',
  fetchOpts: { repo: 'https://containous.github.io/traefik-helm-chart' },
}, { provider: cluster.provider })

To expose MLFlow on the Traefik we just installed:

// Expose MLFlow in Traefik as /mlflow 
new TraefikRoute('mlflow', {
  prefix: '/mlflow',
  service: mlflow.getResource('v1/Service', 'mlflow', 'mlflow'),
  namespace: 'default',
}, { provider: cluster.provider, dependsOn: [mlflow] });

The TraefikRoute component is available here.

To get Traefik’s public hostname, add the following line:

export const hostname = traefik.getResource('v1/Service', 'traefik').status.loadBalancer.ingress[0].hostname;

After running pulumi up again, you should be able to get that hostname by running:

pulumi stack output hostname

You can now create a CNAME record in your domain’s DNS provider.

You should now be able to access MLFlow at http://yourdomain.com/mlflow 🙂

Security Warning

In real-life you would want to set up HTTPS and remove HTTP in Traefik’s Helm chart values. In this guide we use HTTP only.

Model Template: Training

Let’s start building our model template. Clone the model-template repo and start a new Poetry package:

poetry new --src my_model

We’ll call it my_model for now, and change to Cookiecutter variables later on.
Create a my_model.training package and add your training starting point. Here we’ll use a simple LightGBM example:

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, log_loss
import pandas as pd
import lightgbm as lgb


# Prepare training data
df = pd.read_csv('data/iris.csv')
flower_names = {'Setosa': 0, 'Versicolor': 1, 'Virginica': 2}


X = df[['sepal.length', 'sepal.width', 'petal.length', 'petal.width']]
y = df['variety'].map(flower_names)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

train_data = lgb.Dataset(X_train, label=y_train)

def main():
  # Train model
  params = {
    "objective": "multiclass",
    "num_class": 3, 
    "learning_rate": 0.2,
    "metric": "multi_logloss",
    "feature_fraction": 0.8,
    "bagging_fraction": 0.9,
    "seed": 42,
  }

  model = lgb.train(params, train_data, valid_sets=[train_data])

  # Evaluate model
  y_proba = model.predict(X_test)
  y_pred = y_proba.argmax(axis=1)

  loss = log_loss(y_test, y_proba)
  acc = accuracy_score(y_test, y_pred)
 

if __name__ == "__main__":
    main()

To make it work, you will need to iris.csv dataset from here.

You can also add a Poetry script to make training easy to run. Add this to your pyproject.toml file:

[tool.poetry.scripts]
train = "src.my_model.training.train:main"

And now you can run:

poetry run train	

OK, let’s add MLFlow client to this training code. I’ve highlighted the changes:

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, log_loss
import pandas as pd
import lightgbm as lgb
import mlflow
import mlflow.lightgbm


# Enable auto logging
mlflow.set_tracking_uri('http://yourdomain.com/mlflow')
mlflow.lightgbm.autolog()


# Prepare training data
df = pd.read_csv('data/iris.csv')
flower_names = {'Setosa': 0, 'Versicolor': 1, 'Virginica': 2}


X = df[['sepal.length', 'sepal.width', 'petal.length', 'petal.width']]
y = df['variety'].map(flower_names)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

train_data = lgb.Dataset(X_train, label=y_train)

def main():
  with mlflow.start_run() as run:
    # Train model
    params = {
      "objective": "multiclass",
      "num_class": 3, 
      "learning_rate": 0.2,
      "metric": "multi_logloss",
      "feature_fraction": 0.8,
      "bagging_fraction": 0.9,
      "seed": 42,
    }
    
    model = lgb.train(params, train_data, valid_sets=[train_data])

    # Evaluate model
    y_proba = model.predict(X_test)
    y_pred = y_proba.argmax(axis=1)
    
    loss = log_loss(y_test, y_proba)
    acc = accuracy_score(y_test, y_pred)

    # Log custom metrics if you want
    mlflow.log_metrics({
      "log_loss": loss, 
      "accuracy": acc
    })

  print("Run ID:", run.info.run_id)

if __name__ == "__main__":
    main()

Try to train the model again and go to your MLFlow server. You should now see the new experiment and download the model file from the browser 🙂

Model Template: Serving

Let’s implement the model server based on FastAPI.
Create a my_model.serving package, and create a basic FastAPI server that can make predictions for your basic model. In our case:

import os
import uvicorn
import mlflow
import numpy as np
import pandas as pd
from fastapi import FastAPI, Request
from pydantic import BaseModel


class FlowerPartSize(BaseModel):
    length: float
    width: float

class PredictRequest(BaseModel):
    sepal: FlowerPartSize
    petal: FlowerPartSize


app = FastAPI()

# Load model
model = mlflow.lightgbm.load_model(f'runs:/{os.environ["MLFLOW_RUN_ID"]}/model')

flower_name_by_index = {0: 'Setosa', 1: 'Versicolor', 2: 'Virginica'}


@app.post("/predict")
def predict(request: PredictRequest):
    df = pd.DataFrame(columns=['sepal.length', 'sepal.width', 'petal.length', 'petal.width'],
                      data=[[request.sepal.length, request.sepal.width, request.petal.length, request.petal.width]])

    y_pred = np.argmax(model.predict(df))
    return {"flower": flower_name_by_index[y_pred]}


def main():
    uvicorn.run(app, host="0.0.0.0", port=8000)

if __name__ == "__main__":
    main()

Note that the model is loaded from the Artifact Storage (the S3 bucket) through MLFlow.

OK, we’ll now need to do some DevOps to make this server run on our Kubernetes. Let’s start by containerizing it.

Add a Dockerfile:

FROM python:3.8-slim
WORKDIR /my_model
STOPSIGNAL SIGINT

ENV LISTEN_PORT 80

# System dependencies
RUN apt update && apt install -y libgomp1
RUN pip3 install poetry

# Project dependencies
COPY poetry.lock pyproject.toml ./

RUN poetry config virtualenvs.create false
RUN poetry install --no-interaction --no-ansi --no-dev

COPY . .

WORKDIR /my_model/src
ENTRYPOINT uvicorn my_model.serving.main:app --host 0.0.0.0 --port $LISTEN_PORT --workers 2

Next, let’s create a Pulumi package for the model that can build & push this Docker image to ECR, and deploy it to Kubernetes.

Create an infra directory inside the model template and run pulumi new, as before.

The Pulumi code should look something like:

import * as pulumi from '@pulumi/pulumi';
import * as awsx from '@pulumi/awsx';
import * as k8s from '@pulumi/kubernetes';
import * as kx from '@pulumi/kubernetesx';
import TraefikRoute from './TraefikRoute';

const config = new pulumi.Config();
const baseStack = new pulumi.StackReference(config.require('baseStackName'))

// Connect to the Kubernetes we created in mlplatform-infra
const provider = new k8s.Provider('provider', {
  kubeconfig: baseStack.requireOutput('kubeconfig'),
})

// Build & push Docker image to ECR
const image = awsx.ecr.buildAndPushImage('my-model-image', {
  context: '../',
});

const podBuilder = new kx.PodBuilder({
  containers: [{
    image: image.imageValue,
    ports: { http: 80 },
    env: {
      'LISTEN_PORT': '80',
      'MLFLOW_TRACKING_URI': 'http://yourcompany.com/mlflow',
      'MLFLOW_RUN_ID': config.require('runID'),
    }
  }],
  serviceAccountName: baseStack.requireOutput('modelsServiceAccountName'),
});

const deployment = new kx.Deployment('my-model-serving', {
  spec: podBuilder.asDeploymentSpec({ replicas: 3 }) 
}, { provider });

const service = deployment.createService();


// Expose model in Traefik 
new TraefikRoute('my-model', {
  prefix: '/models/my-model',
  service,
  namespace: 'default',
}, { provider, dependsOn: [service] });

Note that on line 30 we use a special Kubernetes ServiceAccount that can read from S3 buckets.
This is similar to the ServiceAccount we created for MLFlow, but it is read-only – you can just copy-paste that piece of code in mlplatform-infra and change readOnly to true.

Model Template: Monitoring

We’ll now set up a data drift monitor using Aporia. For this guide we’ll use Aporia’s cloud. However, Aporia can also be easily installed in your private VPC.

Start by creating a free account. Then, click the “Add Model” button:

Fill some basic details for your new model:

Then, integrate Aporia into your training code:

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, log_loss
import pandas as pd
import lightgbm as lgb
import mlflow
import mlflow.lightgbm
import aporia
            
            
aporia.init(token="<YOUR APORIA TOKEN>", 
            environment="local-dev",
            verbose=True)


# ...

def main():
  with mlflow.start_run() as run:
    # Train model
    # ...

  # Log the new model version to Aporia
  apr_model = aporia.create_model_version(
    model_id="my-model",
    model_version=run.info.run_id,
    model_type="binary",
    features=aporia.pandas.infer_schema_from_dataframe(X),
    predictions=aporia.pandas.infer_schema_from_dataframe(y.to_frame()),
  )
  
  # Log training set to Aporia
  #  (only aggregations, because it can be huge)
  apr_model.log_training_set(
    features=X_train,
    labels=y_train.to_frame(),
  )

  # Log test set to Aporia
  #  (only aggregations, because it can be huge)
  apr_model.log_test_set(
    features=X_test,
    labels=y_test.to_frame(),
    predictions=pd.DataFrame(data=y_pred, columns=['variety']),
  )
  
  print("Run ID:", run.info.run_id)

if __name__ == "__main__":
    main()

And integrate Aporia to your serving code:

import os
import uvicorn
import mlflow
import numpy as np
import pandas as pd
from fastapi import FastAPI, Request
from pydantic import BaseModel
import uuid
import aporia

     
aporia.init(token="<YOUR APORIA TOKEN>", 
            environment="production",
            verbose=True)

apr_model = aporia.Model("iris-model", os.environ["MLFLOW_RUN_ID"])

# ...


@app.post("/predict")
def predict(request: PredictRequest):
    df = pd.DataFrame(columns=['sepal.length', 'sepal.width', 'petal.length', 'petal.width'],
                      data=[[request.sepal.length, request.sepal.width, request.petal.length, request.petal.width]])

    y_pred = np.argmax(model.predict(df))
    
    apr_model.log_prediction(
        id=str(uuid.uuid4()),
        features=aporia.pandas.pandas_to_dict(df),
        predictions={"variety": y_pred},
    )
    
    return {"flower": flower_name_by_index[y_pred]}


def main():
    uvicorn.run(app, host="0.0.0.0", port=8000)

if __name__ == "__main__":
    main()	

Finally, create a model drift monitor:

Model Template: CI/CD

We’ll now implement a basic CI/CD pipeline for our model template.
This pipeline will run when the data scientist pushes a commit to the main branch. It will deploy the model to our Kubernetes cluster.
We’re going to use GitHub Actions for CI/CD, which is GitHub’s native CI/CD system.
Our GitHub Action is going to be very simple – all it’s going to do is to run make deploy, and the real deployment logic is going to be implemented in a Makefile. This is useful in case you want to deploy the model from your local computer – for example, if GitHub Actions is down and you need to deploy urgently 🙂
Let’s start by adding the Makefile:

SHELL := /bin/bash

STACK_NAME=dev
BASE_STACK_NAME=mlplatform-infra/dev
PULUMI_CMD=pulumi --non-interactive --stack $(STACK_NAME) --cwd infra/

install-deps:
	curl -fsSL https://get.pulumi.com | sh
	npm install -C infra/

	sudo pip3 install poetry --upgrade
	poetry install

train:
	poetry run train

deploy:
	$(PULUMI_CMD) stack init $(STACK_NAME) || true
	$(PULUMI_CMD) config set aws:region $(AWS_REGION)
	$(PULUMI_CMD) config set baseStackName $(BASE_STACK_NAME)
	$(PULUMI_CMD) config set runID $(shell poetry run train | awk '/Run ID/{print $$NF}')
	$(PULUMI_CMD) up --yes --skip-preview

This Makefile contains 3 commands:

  • make install-deps
    • Install Pulumi and its dependencies
    • Create a virtualenv and install dependencies using Poetry
  • make train
    • Train the model
  • make deploy
    • Train the model and take the MLFlow Run ID
    • Run Pulumi with this Run ID to deploy it to the K8s cluster

Next, we’ll create the GitHub Action that deploys the model on push to the main branch.
Create .github/workflows/deploy.yml:

name: Deploy
on:
  push:
    branches:
      - main
jobs:
  deploy:
    runs-on: ubuntu-20.04
    steps:
      - name: Checkout
        uses: actions/checkout@master

      - name: Install dependencies
        run: make install-deps

      - name: Deploy
        run: make deploy
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          AWS_REGION: ${{ secrets.AWS_REGION }}
          AWS_ECR_ACCOUNT_URL: ${{ secrets.AWS_ECR_ACCOUNT_URL }}
          PULUMI_ACCESS_TOKEN: ${{ secrets.PULUMI_ACCESS_TOKEN }}

It takes AWS credentials and Pulumi access token from GitHub Secrets.

Model Template: DVC Integration

Versioning your data is extremely important – when there’s an issue with your model in production, and you need to debug it, you really want to know how your training set looked like. 

If your datasets are super small you could use Git for that. If not, you’ll need some other method to track versions of your datasets. In this guide, we’ll use DVC, which stores your data in an S3 bucket.

First, let’s create an S3 bucket for DVC. In your mlplatform-infra repo, add the following code:

// Create S3 bucket for DVC
const dvcBucket = new aws.s3.Bucket("dvc-bucket", {
  acl: "public-read-write",
});

export const dvcBucketURI = dvcBucket.bucket.apply(
    (bucketName: string) => `s3://${bucketName}`);

Next, we need to configure our model template to use this DVC repo. Inside the model template run:

cd data/
poetry run dvc init --subdir
poetry run dvc remote add --project -d s3 <DVC_BUCKET_URI>

You can fetch the DVC Bucket URI from Pulumi’s outputs:

pulumi stack output dvcBucketURI

pulumi stack output dvcBucketURI

Awesome, we should have DVC configured by now. Let’s add our iris.csv to DVC:

poetry run dvc add ./iris.csv
poetry run dvc push

To make CI/CD work, we need to make sure to pull from DVC before each training. Add that to Makefile:

dvc-pull:
	poetry run dvc --cd data/ pull
	
train: dvc-pull
	poetry run train

deploy: dvc-pull
	$(PULUMI_CMD) stack init $(STACK_NAME) || true
	$(PULUMI_CMD) config set aws:region $(AWS_REGION)
	$(PULUMI_CMD) config set baseStackName $(BASE_STACK_NAME)
	$(PULUMI_CMD) config set runID $(shell poetry run train | awk '/Run ID/{print $$NF}')
	$(PULUMI_CMD) up --yes --skip-preview

Model Template: Cookiecutter

Cookiecutter allows you to convert our model template to a real template – when the user clones it, he’ll be able to easily change the name of the model, its author, etc.

Start by adding a cookiecutter.json file:

{
  "full_name": "Alon Gubkin",
  "email": "alon@aporia.com",
  "project_name": "My Model",
  "project_slug": "{{ cookiecutter.project_name.lower().replace(' ', '-') }}",
  "module_name": "{{ cookiecutter.project_name.lower().replace(' ', '_').replace('-', '_') }}",
  "project_short_description": "My awesome model!",
  "version": "0.1.0"
}

These are the variables that the user can easily change when he clones the template.

Next, search for all occurrences of my_model and change them to:

   {{ cookiecutter.module_name }} 

This works in both file names and file content.

The user can then clone the model template by running:

   cookiecutter <your-github-organization>/model-template

What’s missing from this?

Well… a lot actually!

Here’s a partial list:

  • HTTPS & Authentication
  • Environments (staging, production)
  • Common library for preprocessing, postprocessing, etc
  • Model input & validation
  • Training orchestration
  • and probably much more!

Thank you for reading!

If you’re interested in this project, make sure to star the GitHub repo 🙂

On this page

Great things to Read

Green Background

Control All your GenAI Apps in minutes