Back to Blog
Product Updates

DDC – Direct Data Connectors: Monitoring ML Models at Scale

Alon Gubkin Alon Gubkin 6 min read Jan 05, 2023

Intro

We are excited to announce Direct Data Connectors (DDC), a novel way to monitor your Machine Learning models in production by connecting directly to your training and inference datasets. DDC allows you to monitor without duplicating any of your data. You can now monitor billions of predictions without data sampling, production code changes, or hidden cloud costs.

By simply connecting Aporia to a database where you already store your model predictions, you immediately get fully-customizable ML dashboards tailored to your use case, customizable drift detection, live alerting, XAI, and root-cause analysis tools at your fingertips. Getting started with ML Monitoring has never been easier, and we are releasing this new capability with support for BigQueryAmazon S3AthenaGlue Data CatalogDelta LakePostgresRedshiftSnowflake, Azure Data Lake Storage, and Databricks – we are continuously adding more connectors.

The Current State of Monitoring & Why DDC is Essential

When looking across the ML monitoring market, we see a gap between the flexibility, efficiency, and security that organizations prioritize from their monitoring solutions and the fact that other monitoring solutions act solely as inference stores.

This gap and the following challenges are why DDC is essential to getting the most out of your production models: 

  1. Monitoring models with SDK is cumbersome — With traditional ML Monitoring solutions, integrating a new model requires importing an SDK to your production code. This means you’re required to prioritize this integration task as part of a development sprint – go through staging, testing, and production. This process is cumbersome, requires the assistance of Software Engineers, and takes a very long time just to integrate a single model – the average integration time per model is 3 weeks.
Integrating Model Monitoring with SDK

Outrageous cloud costs — ML Monitoring solutions that are based on databases like Apache Druid, Elasticsearch, or Clickhouse can quickly become extremely expensive, reaching $10,000+ monthly in cloud costs, in addition to the monthly maintenance fees that accompany these databases.

How cloud costs rise when duplicating production data

3. Data sampling comes with a distorted view — Many ML use cases require processing billions of predictions – common examples include recommendation systems, search ranking models, large fraud detection models, and some types of demand forecasting models.

As a result, many of the companies we spoke with were forced to monitor only a small random sample of their data in production. Unfortunately, with small samples of data, ML monitoring becomes highly inaccurate – issues go unnoticed, false positive alerts are common, and monitoring drift, bias, or fairness issues becomes ineffective.

4. Production data duplication — When implementing a monitoring solution that uses an SDK / Importer for reporting data, these systems often store a copy of your data in their own proprietary format in their database.

This results in the following:

  • Vendor lock-in: ML teams who rely on the ML Monitoring solution as their production inference DB find themselves at a huge risk of losing all their production data if they ever wish to switch to another ML Monitoring solution.
  • No single source of truth: Duplicating production data across two databases – an internal one, as well as the monitoring solution – could end in data discrepancy as there is no guarantee that both databases will get updated and synchronized correctly. As a result, it might be difficult to know which data is reliable.
  • Doubling the cloud costs: As data is being duplicated, you might find yourself paying twice for storing and processing the same data. When dealing with a large scale of petabytes, this could result in pricey cloud invoices.

Introducing DDC: A Revolutionary Path to Secure, Confident, & Accurate Monitoring 

DDC is a transformative technology that empowers ML teams to effortlessly monitor and track their ML models by seamlessly integrating Aporia with their production database. By directly accessing your existing data lake, you can effortlessly monitor billions of predictions at minimal cloud costs.

The Head of Data Science from a known e-commerce platform in the US, managing billions of dollars in transactions annually, shared their experience with DDC – “Integrating Aporia’sDDC directly to our BigQuery  was easier than expected, and we were able to onboard a dozen models in less than a day.”


With Aporia’s DDC, model monitoring is made easy, helping your ML teams shine in production, and check off necessary tasks that benefit the entire organization:

✅ Monitoring models with DDC is easy – ~7 minutes for model integration
✅ Clear and low cloud costs
✅ Monitor ALL your data at once
✅ Your data stays yours
✅ No vendor lock-in
✅ A single source of truth 


We see more and more ML teams who create a centralized store for their production inference data. By doing so, they can audit and investigate historical data, have more quality data for training, and monitor their models in a matter of minutes. If you aren’t already storing your predictions, read our quick guide on Storing Your Predictions.

By decoupling the storage of inference data from the monitoring system, your data stays yours, in your own format, in your data store. There is zero risk of losing your precious data with a vendor-proprietary database.

How to Use DDC in Aporia

With DDC enabled, integrating Aporia to your data source is accomplished in only three simple steps:

Connect your data source (not limited to the databases displayed):


Link your Dataset:


Define your model schema and start monitoring your production predictions:

That was simple. In just a few short clicks, monitoring is made easy, secure, and cost-saving. Now, just wait for insights to pour in and start showcasing the value of your predictions.


Conclusion

With Aporia’s DDC, model integration is as easy as writing an SQL query, and can be completed in minutes.  

If you’d like to learn more about Direct Data Connectors and see how it benefits your organization, please reach out to us

Rate this article

Average rating 0 / 5. Vote count: 0

No votes so far! Be the first to rate this post.

Slack

On this page

Blog
Building an AI agent?

Consider AI Guardrails to get to production faster

Learn more

Related Articles