Back to Blog
Product Updates

How to build an end-to-end ML pipeline with Databricks & Aporia

Alon Gubkin Alon Gubkin
4 min read Jul 25, 2023

This tutorial will show you how to build a robust end-to-end ML pipeline with Databricks and Aporia. Here’s what you’ll achieve:

  1. Train and deploy models using Databricks and MLflow.
  2. Store inference data in Databricks Lakehouse (these are the inputs and outputs of the models in production).
  3. Connect Aporia to this inference data to track key metrics and monitor for data drift and performance degradation at scale.
  4. Allow non-technical stakeholders to create dashboards correlating ML models to business metrics.

Train and Deploy Models on Databricks using MLflow

Your journey begins with training your models and deploying them to production using Databricks and MLflow.

Step 1: Model Training

For this, we highly recommend browsing through the Databricks solution accelerators notebooks, which include examples for various use-cases:

In each notebook, you’ll find step-by-step instructions on how to train these models using Databricks.

Step 2: Model Deployment

Once you’ve successfully trained your models, you can use MLflow to package them for deployment. MLflow helps package the model in a format that can be used for inference, regardless of how or where it was initially trained. 

For batch models, you can create a scheduled job on Databricks to run the model on an hourly/daily/weekly/monthly basis.

For online models, you have two main options:

  1. Use Databricks Model Serving to launch a REST endpoint. 
  2. Use your own web service to call the model externally.

Store inference data in Databricks Lakehouse

In production, your models will be making predictions on real-world data. The inputs and outputs of these models are known as inference data. It’s important to store this data for future reference, debugging, and model improvement.

By configuring your deployed models to log their inference data in Databricks Lakehouse, you not only have a safe storage solution but also a rich source of data for retraining your models and enhancing their performance over time.

Integrating Aporia for ML Observability

For the next step in the ML pipeline, we’ll integrate the inference data into Aporia – the ML Observability platform, dedicated to monitoring ML models in production. 

Aporia has a built-in integration with Databricks and does not send your data outside of the Lakehouse.

In three easy steps you can start monitoring billions of predictions and gain insights to improve model performance:

  1. Select your data source.
  2. Link your dataset.
  3. Map model schema.

The Magic of Centralized Model Management with Aporia

Managing multiple models separately can be daunting and often result in chaos and missed opportunities. Once integrated, Aporia simplifies this process by providing a unified hub for all your models, acting as a single source of truth for AI projects. This centralized view allows you to monitor billions of predictions at once and track key metrics across different models, providing a holistic view of your production ML pipeline.

For each model, your AI leaders, engineers, and data scientists can customize dashboards to track performance, drift, and business metrics.

By directly connecting to your inference data from your Lakehouse, Aporia can constantly monitor the model’s performance and detect any significant changes in behavior or drift in your data.

Alerts, analysis, and insights

When drift is detected, Aporia raises an alert directly to your communication channel of choice, be it Slack, Microsoft Teams, Jira, PagerDuty, Webhook, or email. 

You can then leverage the Aporia Production IR (Investigation Room) to investigate and explore your production data collaboratively with other team members, in a notebook-like experience. 

Drift analysis reveals when the drift started, where it first originated, and the top drifted features that most impacted model predictions. 

Segment analysis helps you identify problematic or excelling segments, taking the segment size and comparison metrics into account. 

Closing the Loop

With the Databricks and Aporia ML pipeline, you can effortlessly train, deploy, monitor, and manage your models within the comfort of your Databricks environment. This synergy enables you to continuously improve your models, promptly address issues, and ultimately provide better value to your users. 

ML observability is the heart of successful ML products. Aporia’s integration with Databricks Lakehouse empowers ML teams to effortlessly monitor all of their models, all in one place. This ensures that every model is held to the highest standard of performance and reliability, so organizations can truly rely on their ML initiatives to drive impactful business decisions.

Want to learn more about Aporia on Databricks? Drop us a line or try it out and see how easy ML observability can be.

On this page

Prevent Data
Leakage
in real time
Book a Demo

Great things to Read

Green Background

Control All your GenAI Apps in minutes