Updated Mar 10 2021

The Ultimate Data Observability Checklist

Stephen Guerguy

Stephen is on Monte Carlos Growth team.

Your data team is investing in data observability to tackle your company’s new data quality initiative. Great! But what does this mean for you?

Here’s everything you should keep in mind when building or buying a data observability platform.

As data becomes an increasingly important part of decision making and product development, data reliability and quality are top of mind for data engineering and analytics teams.

To address this need, teams are becoming increasingly bullish on data observability, a new layer of the modern data stack that allows teams to keep tabs on the health of their most critical analytics and build organizational trust in data at scale. Just as robust observability and monitoring helps engineering teams prevent application downtime, data observability tools provide visibility, automation, and alerting that help data teams minimize and prevent data downtime.

Investing in data observability is becoming increasingly important as companies collect more and more (often third-party) data. Introducing new data sources and expanding access to new data consumers also naturally leads to more complex pipelines—which increases the opportunities for missing, stale, or otherwise inaccurate data to negatively affect your business.

With data observability, teams can answer common questions like:

Is my data reliable?
Where does my data live, and when will it be updated?
How do I know when changes have been made to this data?
Which data assets are most important to my business? How are they being used? Are they fresh and accurate?

As the need for data observability increases, some teams are building their own tools, while others look to existing solutions. While we couldn’t be more excited about this growing demand, we also understand that it can be challenging to cut through the noise and understand exactly what your data observability platform needs to deliver.

The following checklist will cover specific outcomes that a robust data observability platform should deliver, organized into four key categories:

Knowing about data downtime
Troubleshooting data issues
Preventing data downtime in the first place
Protecting your data with a security-first approach

Let’s dive in.

Knowing about data downtime

At a fundamental level, a data observability platform should monitor your data’s health and alert your team when pipelines break or jobs don’t run well before your downstream data consumers notice the issue in their dashboards or queries. Knowing what issues to monitor for can be overwhelming which is where a ML driven approach is critical. Instead of relying on manual tests to check for data quality issues, your data observability platform should intelligently surface when freshness, volume and schema changes are noticed.

The following are two examples of common data issues your data observability platform should automatically detect.

In the first example, we see a significant unexpected drop in data volume within a table, indicating a deletion of records which may or may not be planned.

Anomalous data volume detection — Image courtesy of Monte Carlo.

In the second example, we see a schema change to JSON data stored within a table field, which similarly may or may not be planned, but will absolutely require downstream changes when parsing the JSON values.

Data observability schema changes — Your data observability solution should leverage machine learning to proactively identify when expectations around data freshness, volume, schema, distribution, and lineage don’t meet assumptions. Did 50 tables suddenly turn into 5,000? Data observability should ensure you’re the first to know about and solve for this incident. Image courtesy of Monte Carlo.

Proactively monitor and alert for data quality issues

This is table stakes for any data observability platform. Just as software engineers rely on monitoring to alert them to any outages or downtime, data teams should be able to rely on monitoring to automatically notify them of any anomalies in data health rather than waiting for an internal consumer or customer to notice something’s gone wrong (which happens to a staggering 57% of businesses.)

Alerts routed intelligently based on data set assignments and incident type

Alerts are only helpful if the proper team (i.e., the data owner) receives them. Your data observability platform should enable routing based on assignments and incident types.

Centralize, standardize, and automate data quality monitoring processes

Data observability should apply across your entire organization, standardizing and centralizing data quality monitoring to ensure every team is speaking the same language and operating from the same playbook.

Automatically generate rules based on historical data

With machine learning, your data observability platform should be able to analyze trends in your data to automatically develop hundreds of rules without anyone in your organization needing to write a single line of code—identifying anomalies, null values, or missing data, and alerting your team accordingly for further investigation.

Provide the ability for data engineers to set their own rules and thresholds based on business-specific requirements

While automation saves manual labor, your data organization knows your data assets best. As a result, your data observability platform should include the flexibility for data engineers, data analysts, and data scientists to set their own thresholds to monitor data according to business needs or as new pipelines and projects are introduced.

Automated anomaly detection based on historical trends at the field level

The aforementioned ML-based anomaly detection should function at the field level—not just the table level–based on historical metrics and statistics about your data.

Self-service data discovery

Data observability doesn’t just cover anomaly detection. A strong data observability platform will enable users to the data they need by indexing all assets (including BI reports and dashboards) for search and discovery. Data health metrics, augmented with business context, give your data consumers super powers when it comes to finding the right data for any given job.

Continuous monitoring and alerting against standard data quality dimensions as well as lineage

Alerts against quality dimensions are great: they notify you when data is missing, incomplete, or otherwise inaccurate. But without lineage—tracking all data integration operations on a record-by-record basis, across all pipelines and environments—you’ll have a much harder time figuring out what went wrong. Both should be provided by your data observability platform.

Troubleshooting data issues

The second category of outcomes — troubleshooting data issues so you can fix them, fast — should focus on making it easy to understand what data broke and why, while also giving your team the tools to track downstream and upstream dependencies before they impact the business.

Understand the business impact of issues and prioritize issue resolution

An intelligent platform will be able to assess the likely business impact of data downtime based on downstream data consumption—and prioritize issue resolution accordingly.

Intelligently route alerts to the appropriate team(s) via their preferred channel(s)

Alerts should be meaningful communications, not white noise. Not only should your data observability platform support intelligent routing of alerts to the right team members—ideally, those alerts should be delivered via their preferred channels (think Slack, email, or SMS text message).

Search for and explore data assets with a simple UI

When it comes to addressing broken data, speed and accessibility matters. A clean, simple interface helps your team search for and review data assets swiftly, using Catalog and Lineage functionalities to identify the root cause of incidents.

Collect and display information required for investigating and resolving issues

Your platform should deliver all the relevant information required to resolve issues at the alert or case level, reducing the need for manual investigation and speeding up the time to resolution.

Root cause analysis via end-to-end lineage

When your platform provides full visibility into your end-to-end lineage, you can take the guesswork out of root cause analysis by immediately accessing the full lifecycle of your data.

Collaborate to resolve incidents and mark them as notified, resolved, no action required or false positive

Your platform should make life easier by keeping everyone affected by a given incident on the same page with visible status reports on every issue.

Access incident alert logs with complete incident information for analysis and reporting

Comprehensive logs are a must: they help every team member stay in the loop, reduce time-to-resolution for common issue types, and enable data leaders to analyze and identify recurring issues.

API integrations to ensure that monitoring and alerting about key elements of your pipeline can be routed to the appropriate party

A modern data observability platform should offer thorough API access to all attributes so you can customize workflows that fit into your existing tech stack and empower your team to save time, avoid duplicative efforts, and solve problems swiftly.

Preventing data downtime

No matter how great your data observability platform is at identifying and alerting you to data downtime, your best efforts at maintaining reliable data pipelines will be all for naught if you can’t be proactive about incident prevention. A true end-to-end data observability solution will take data quality monitoring a step further by leveraging machine learning to understand historical patterns in your data through intelligent catalog and lineage capabilities.

Data monitoring dashboard — A strong data observability platform will support data discovery through a data catalog and data monitoring dashboard that keeps users up-to-date on important elements of data health. Image courtesy of Monte Carlo.

Data quality and health information for all assets

Apart from alerting when something goes wrong, data observability tooling should provide a holistic view of data quality and health for all assets at all times based on both historical patterns and the current state of data affairs.

Understand fields and field types at a glance through lineage

When a data observability platform includes lineage, you should gain a “catalog”-level view into data fields and field types that provides an at-a-glance overview of the relationships between data assets and overall data health.

Understand critical reports and dashboards tied to table and field lineage

Data observability should enable data teams to understand which BI reports use which fields—helping ensure schema changes don’t unintentionally break BI reports or frustrate downstream data consumers.

Analyze field-level metrics/ statistics and how they’ve changed over time

Your data observability platform should be able to understand how the data is changing and evolving over time, so you can track important metrics at a granular level.

Dynamically update schema metadata and information

With a data observability platform, your metadata should update dynamically, without requiring manual changes.

Identify what your key assets are and how they are being used

Your data observability platform should answer pertinent questions, including: who is using this data, what are they using it for, when are they accessing this data, and how often does this data go “down” (i.e., stale, missing, etc.)? Data observability should include insights that help you understand how key data assets are being used, helping enable version control and allowing you to identify and deprecate outdated assets—ensuring you’re not using the wrong data for important business decisions or to power digital products.

Assess timeliness and completeness by analyzing table load pattern metrics

Your data platform should help you verify assumptions about your data’s freshness and volume, ensuring that your expectations meet reality.

Protecting your data

Data observability platforms should never require data to leave your environment — full stop. The best approaches will not require having to store or access individual records, PII, or any other sensitive information. Instead, data observability should only extract query logs, metadata, and aggregated statistics about data usage to ensure that your most critical data assets are as trustworthy and reliable as possible.

Monitoring data at rest

Your data platform should be designed with a security-first approach that monitors data at rest and does not require access to your data warehouse.

SOC2-certified

Your data observability platform needs to be SOC2-certified, which measures controls including security, availability, processing integrity, confidentiality, and privacy.

Identify breaking changes quickly from query logs and schema change logs

Your platform should use metadata such as query logs and schema change logs to automatically alert when changes break your data pipelines. By using this metadata instead of the data itself, your platform provides an additional layer of security.

Getting started with data observability

Ultimately, your data observability should enable everyone who interacts with your data, from an engineer to an end consumer, to answer a complex question with confidence: “Is this data trustworthy?”

Whether you build your own solution or buy one from a vendor, true data observability will bolster trust throughout your organization and open up new opportunities for teams to leverage data. At the same time, your data engineering team should get valuable time back from constant fire drills and manual investigations, allowing them to focus on innovating, solving problems, and helping your business grow.

Interested in learning more about what to look for in a data observability platform? Reach out to Stephen and the rest of the Monte Carlo team.

Talk to us!