🎉 AI Engineers: Join our webinar on Prompt Engineering for AI Agents. Register here >>

Back to Blog
Machine Learning

From data to ML: Building an end-to-end machine learning pipeline with Aporia on Snowflake

Building an end-to-end machine learning pipeline with Aporia on Snowflake
Alon Gubkin Alon Gubkin
5 min read Aug 01, 2023

In this tutorial, we will build a robust end-to-end machine-learning pipeline leveraging Snowflake’s Snowpark and Aporia. We will train and deploy models, store inference data in the Snowflake Data Cloud, and integrate Aporia for ML observability, monitoring, and improving model performance in production.

Step 1: Model training with Snowpark

Snowpark is a developer experience provided by Snowflake that includes runtimes and libraries that securely deploy and process non-SQL code in Snowflake and enable customers to build data pipelines using familiar programming constructs and languages. Explore the Snowpark documentation.

1.1 Writing data transformation in Snowpark

Using Snowpark, write data transformations and feature engineering steps for your ML model. You can write these using DataFrame APIs in languages like Python, Scala, or Java. We strongly recommend checking out Snowflake’s guides for a more in-depth understanding.

Step 2: Model deployment

Once your model is trained, you need to deploy it. You can do this through Snowflake’s External Functions which allows calling external APIs, or use a serving tool like TensorFlow Serving.

For batch models, you can create a scheduled job on Snowflake to run the model on an hourly/daily/weekly/monthly basis.

Step 3: Storing inference data in the Snowflake Data Cloud

When your models are operational in production, they will generate predictions based on real-world data. This data, comprising the inputs and the predictions made by the models, is referred to as inference data. Storing this data is crucial for future analysis, monitoring, troubleshooting, and refining the models.

By setting up your models to record inference data within Snowflake’s Data Cloud, you are equipped with not just a robust and scalable storage environment, but also a treasure trove of data that can be harnessed to continually improve and optimize your models over time.

Now, configure your deployed model to log inference data (inputs and outputs) in the Snowflake Data Cloud. 

For example, create a table to store inference data. This example uses a Fraud model:

transaction_idtransaction_timestampaccount_idtransaction_amountmerchant_categorycountryprevious_transactions_countmodel_predictionis_fraud
12023-06-11 09:45:003425450.00ElectronicsUS240.15No
22023-06-11 10:32:0012983200.00JewelryFR100.89Yes
32023-06-11 11:17:00874295.00FoodUS320.03No

Step 4: Integrating Aporia for ML observability

Connect your Snowflake inference tables to Aporia for model monitoring and visibility. This is crucial for understanding and improving model performance, and identifying data drift quickly. 

Aporia can be seamlessly integrated with Snowflake, ensuring that your data remains securely within the Data Cloud. This is crucial for data governance and security, as it ensures that your data is not exposed outside of your managed environment.

4.1 Connecting Aporia to Snowflake

Connect Aporia to your Snowflake account by selecting Snowflake as the data source and providing the necessary credentials.

4.2 Linking datasets and mapping model schema

Next, it’s time to link your dataset and map model schema. Choose the dataset (table) where your inference data is stored. 

Map your model schema, and within minutes you can start looking at your models in a whole new way. That’s it! Aporia is now up and running—time to dive into observability. 

4.3 Centralize model management

Get a unified view of all your models under a single hub, and see an overview of data integrity, behavior, and model performance. 

4.4 Monitor model performance

With Aporia, easily monitor billions of predictions at once directly from your Snowflake data without duplicating or moving any data. Build monitors to detect drift, bias, degradation, and data integrity issues at scale. Get live alerts directly to your preferred communication channels (Slack, Teams, PagerDuty, Jira, email, Webhooks) and investigate them with Aporia’s Production IR. 

4.5 Track performance metrics

Easily track metrics like accuracy, precision, recall, and data drift, or create custom metrics to view what matters to you and ensure data science and business goals are aligned. Become a widget wizard, and tailor model visibility with Aporia’s ML and business dashboards. 

4.6 Investigate alerts and gain insights

Once your monitors fire an alert directly to your Slack, MS Teams, or email, you can easily drill down into your production data and investigate the root cause. 

Now it’s time to use Aporia Production IR to pinpoint the root cause of your alert. Analyze and explore your data by drift, segment, and distribution in a collaborative, notebook-like experience to gain deep insights into your model’s behavior and improve model performance. 

For NLP, LLM, and CV models use Aporia’s Embedding Projector to visualize your unstructured data and find problematic patterns and clusters. 

As part of your investigation process, you’ll want to learn more about your model’s decision making process, and Aporia’s Explainable AI toolkit allows you to uncover feature impact, communicate prediction logic to key stakeholders, and simulate ‘what if’ scenarios. 

Conclusion

By integrating Snowflake and Aporia, you can efficiently train and deploy in Snowpark and monitor and manage production models with Aporia. This combination enables continuous model improvement, prompt issue resolution, and valuable insights to perfect your ML pipeline. 

Want to learn more about Aporia on Snowflake? Feel free to reach out to us.

On this page

Great things to Read

Green Background

Control All your GenAI Apps in minutes