In this tutorial, we will build a robust end-to-end machine learning pipeline leveraging Snowflake’s Snowpark and Aporia. We will train and deploy models, store inference data in the Snowflake Data Cloud, and integrate Aporia for ML observability and monitoring.
In this tutorial, we will build a robust end-to-end machine learning pipeline leveraging Snowflake’s Snowpark and Aporia. We will train and deploy models, store inference data in the Snowflake Data Cloud, and integrate Aporia for ML observability and monitoring.
Back to Blog

From data to ML: Building an end-to-end machine learning pipeline with Aporia on Snowflake

Alon Gubkin Alon Gubkin
5 min read Aug 01, 2023

Table of Contents

    In this tutorial, we will build a robust end-to-end machine-learning pipeline leveraging Snowflake’s Snowpark and Aporia. We will train and deploy models, store inference data in the Snowflake Data Cloud, and integrate Aporia for ML observability, monitoring, and improving model performance in production.

    Step 1: Model training with Snowpark

    Snowpark is a developer experience provided by Snowflake that includes runtimes and libraries that securely deploy and process non-SQL code in Snowflake and enable customers to build data pipelines using familiar programming constructs and languages. Explore the Snowpark documentation.

    1.1 Writing data transformation in Snowpark

    Using Snowpark, write data transformations and feature engineering steps for your ML model. You can write these using DataFrame APIs in languages like Python, Scala, or Java. We strongly recommend checking out Snowflake’s guides for a more in-depth understanding.

    Step 2: Model deployment

    Once your model is trained, you need to deploy it. You can do this through Snowflake’s External Functions which allows calling external APIs, or use a serving tool like TensorFlow Serving.

    For batch models, you can create a scheduled job on Snowflake to run the model on an hourly/daily/weekly/monthly basis.

    Step 3: Storing inference data in the Snowflake Data Cloud

    When your models are operational in production, they will generate predictions based on real-world data. This data, comprising the inputs and the predictions made by the models, is referred to as inference data. Storing this data is crucial for future analysis, monitoring, troubleshooting, and refining the models.

    By setting up your models to record inference data within Snowflake’s Data Cloud, you are equipped with not just a robust and scalable storage environment, but also a treasure trove of data that can be harnessed to continually improve and optimize your models over time.

    Now, configure your deployed model to log inference data (inputs and outputs) in the Snowflake Data Cloud. 

    For example, create a table to store inference data. This example uses a Fraud model:

    transaction_idtransaction_timestampaccount_idtransaction_amountmerchant_categorycountryprevious_transactions_countmodel_predictionis_fraud
    12023-06-11 09:45:003425450.00ElectronicsUS240.15No
    22023-06-11 10:32:0012983200.00JewelryFR100.89Yes
    32023-06-11 11:17:00874295.00FoodUS320.03No

    Step 4: Integrating Aporia for ML observability

    Connect your Snowflake inference tables to Aporia for model monitoring and visibility. This is crucial for understanding and improving model performance, and identifying data drift quickly. 

    Aporia can be seamlessly integrated with Snowflake, ensuring that your data remains securely within the Data Cloud. This is crucial for data governance and security, as it ensures that your data is not exposed outside of your managed environment.

    4.1 Connecting Aporia to Snowflake

    Connect Aporia to your Snowflake account by selecting Snowflake as the data source and providing the necessary credentials.

     Image Alt

    4.2 Linking datasets and mapping model schema

    Next, it’s time to link your dataset and map model schema. Choose the dataset (table) where your inference data is stored. 

     Image Alt

    Map your model schema, and within minutes you can start looking at your models in a whole new way. That’s it! Aporia is now up and running—time to dive into observability. 

     Image Alt

    4.3 Centralize model management

    Get a unified view of all your models under a single hub, and see an overview of data integrity, behavior, and model performance. 

     Image Alt

    4.4 Monitor model performance

    With Aporia, easily monitor billions of predictions at once directly from your Snowflake data without duplicating or moving any data. Build monitors to detect drift, bias, degradation, and data integrity issues at scale. Get live alerts directly to your preferred communication channels (Slack, Teams, PagerDuty, Jira, email, Webhooks) and investigate them with Aporia’s Production IR. 

     Image Alt

    4.5 Track performance metrics

    Easily track metrics like accuracy, precision, recall, and data drift, or create custom metrics to view what matters to you and ensure data science and business goals are aligned. Become a widget wizard, and tailor model visibility with Aporia’s ML and business dashboards. 

     Image Alt

    4.6 Investigate alerts and gain insights

    Once your monitors fire an alert directly to your Slack, MS Teams, or email, you can easily drill down into your production data and investigate the root cause. 

     Image Alt

    Now it’s time to use Aporia Production IR to pinpoint the root cause of your alert. Analyze and explore your data by drift, segment, and distribution in a collaborative, notebook-like experience to gain deep insights into your model’s behavior and improve model performance. 

     Image Alt

    For NLP, LLM, and CV models use Aporia’s Embedding Projector to visualize your unstructured data and find problematic patterns and clusters. 

     Image Alt

    As part of your investigation process, you’ll want to learn more about your model’s decision making process, and Aporia’s Explainable AI toolkit allows you to uncover feature impact, communicate prediction logic to key stakeholders, and simulate ‘what if’ scenarios. 

    Conclusion

    By integrating Snowflake and Aporia, you can efficiently train and deploy in Snowpark and monitor and manage production models with Aporia. This combination enables continuous model improvement, prompt issue resolution, and valuable insights to perfect your ML pipeline. 

    Want to learn more about Aporia on Snowflake? Feel free to reach out to us or try out our ML observability platform for yourself

    On this page

      Green Background

      Start Monitoring Your Models in Minutes