Data Observability Tools: Data Engineering’s Next Frontier
To keep pace with data’s lightning innovation speed, data engineers need to invest not only in the latest data modeling and analytics tools, but also technologies that can increase data accuracy and prevent broken ETL pipelines.
The solution? Data observability tools, the next frontier of data engineering and a pillar of the emerging data reliability category.
Data observability tools are technologies that use machine learning for automated data anomaly monitoring and alerting as well as data lineage to accelerate data incident resolution. They help an organization fully understand the health of the data in their system.
As companies become increasingly data driven, the technologies underlying these rich insights have grown more and more nuanced and complex. While our ability to collect, store, aggregate, and visualize this data has largely kept up with the needs of modern data teams (think: domain-oriented data meshes, cloud warehouses, data visualization tools, and data modeling solutions), the mechanics behind data quality and data integrity has lagged.
No matter how advanced your data analytics dashboard is or how heavily you invest in the latest data lake, data warehouse, or lakehouse— your best laid plans are all for naught if the data it ingests, transforms, and pushes to downstream isn’t reliable. In other words, “garbage in” is “garbage out.”
Before we address what data reliability and a modern data observability platform looks like, let’s address how unreliable, “garbage” data is created in the first place.
In this post:
- How good data breaks bad
- Data observability platform criteria and pillars
- Observability vs Data Observability vs ML Observability
- Open-source data testing tools vs data observability tools
- What’s next for data observability tools?
How good data breaks bad
After speaking with thousands of data engineering teams, I’ve noticed there are three primary reasons why good data turns bad:
- A growing number of data sources in a single data ecosystem,
- The increasing complexity of data pipelines, and
- Bigger, more specialized data teams.
More and more data sources
Nowadays, companies use anywhere from dozens to hundreds of internal and external data sources to produce data analytics and machine learning models. Any one of these sources can change in unexpected ways and without notice, compromising the data the company uses to make decisions.
For example, an engineering team might make a change to the company’s website, thereby modifying output of a data set (an unexpected schema change) that is key to marketing analytics. As a result, key marketing metrics may be wrong, leading the company to make poor decisions about ad campaigns, sales targets, and other important, revenue-driving projects.
Increasingly complex data pipelines
Data pipelines are increasingly complex with multiple stages of processing and non-trivial dependencies between various data assets. With little visibility into these dependencies, any change made to one data set can have unintended consequences impacting the correctness of dependent data assets.
Something as simple as a change of units in one system can seriously impact the correctness of another system, as in the case of the Mars Climate Orbiter. A NASA space probe, the Mars Climate Orbiter crashed as a result of a data entry error that produced outputs in non-SI units versus SI units, bringing it too close to the planet.
Like spacecraft, analytic pipelines can be extremely vulnerable to the most innocent changes at any stage of the process.
Bigger, more specialized data teams
As companies increasingly rely on data to drive smart decision making, they are hiring more and more data analysts, data scientists, and data engineers to build and maintain the data pipelines, analytics, and machine learning models that power their services and products, as well as their business operations.
Miscommunication or insufficient coordination is inevitable, and will cause these complex systems to break as changes are made. For example, a new field added to a data table by one team may cause another team’s pipeline to fail, resulting in missing or partial data. Downstream, this bad data can lead to millions of dollars in lost revenue, erosion of customer trust, and even compliance risk.
The good news about bad data? Data engineering is going through its own renaissance and we owe a big thank you to our counterparts in DevOps for some of the key concepts and principles guiding us towards this next frontier of data observability tools.
Data observability platform criteria and pillars
An easy way to frame the effect of “garbage data” is through the lens of software application reliability.
For the past decade or so, software engineers have leveraged targeted solutions like New Relic and DataDog to ensure high application uptime (in other words, working, performant software) while keeping downtime (outages and laggy software) to a minimum.
In data, we call this phenomena Data Downtime. Data Downtime refers to periods of time when data is partial, erroneous, missing, or otherwise inaccurate, and it only multiplies as data systems become increasingly complex, supporting an endless ecosystem of sources and consumers.
Barr Moses introducing data downtime for the first time and how data observability tools can help reduce it.
By applying the same principles of software application observability and reliability to data, these issues can be identified, resolved and even prevented, giving data teams confidence in their data to deliver valuable insights.
Below, we walk through the five pillars of data observability. Each pillar encapsulates a series of questions which, in aggregate, provide a holistic view of data health. Maybe they’ll look familiar to you?
- Freshness: is the data recent? When was the last time it was generated? What upstream data is included/omitted?
- Distribution: is the data within accepted ranges? Is it properly formatted? Is it complete?
- Volume: has all the data arrived?
- Schema: what is the schema, and how has it changed? Who has made these changes and for what reasons?
- Lineage: for a given data asset, what are the upstream sources and downstream assets which are impacted by it? Who are the people generating this data, and who is relying on it for decision making?
A robust and holistic approach to data observability requires the consistent and reliable monitoring of these five pillars through a centralized interface in a data observability platform that serves as a central source of truth about the health of your data.
An effective, proactive data observability platform will connect to your existing stack quickly and seamlessly, providing end-to-end lineage that allows you to track downstream dependencies.
Additionally, it will automatically monitor your data-at-rest without requiring the extraction of data from your data store. This approach ensures that you meet the highest levels of security and compliance requirements and scale to the most demanding data volumes.
Data observability tools use machine learning models to automatically learn your environment and your data, and leverage anomaly detection techniques to let you know when data systems break. They also minimize false positives by taking into account not just individual metrics, but a holistic view of your data and the potential impact from any particular issue.
This approach provides rich context that enables rapid triage and troubleshooting and effective communication with stakeholders impacted by data reliability issues. Unlike ad hoc queries or simple SQL wrappers, such monitoring doesn’t stop at “field X in table Y has values lower than Z today.”
Perhaps most importantly, such a data observability platform prevents data downtime incidents from happening in the first place by exposing rich information about data assets across these five pillars so that changes and modifications can be made responsibly and proactively.
Observability vs Data Observability vs ML Observability
It’s important to note these are a completely different market and category than data observability solutions. There is considerable market confusion as a result of the similar naming conventions and the nascency of the space.
Observability tools are leveraged by software engineers to ensure high application uptime (in other words, working, performant software) while keeping downtime (outages and laggy software) to a minimum. Data observability tools on the other hand, are leveraged by data engineers or other data professionals ensure data pipeline uptime while keeping instances of bad data and broken dashboards to a minimum.
Observability vs Data Observability vs ML Observability
|Observability||Data Observability||ML Observability|
|Built For||IT, software engineering||Data teams, data engineering||Data scientists|
|Purpose||Reduce app downtime||Reduce data downtime||Reduce drift of ML models|
|Monitors||The logs, metrics, and traces an application produces.||Data volume, schema, distribution, and freshness.||Accuracy, recall, precision, F1 Score, MAE, RMSE|
|Visualizing dependencies & root cause analysis||Service maps||Data lineage||Performance tracing|
|IaaS platforms like AWS, Azure, Google Cloud Platform.||Data storage, orchestration, transformation, and BI tiers.||ML platforms like AWS SageMaker or Tensor Flow|
Implementing data mesh or data products
|Root cause analysis of ML model failures|
Identifying deviations in data and performance from baselines.
Splunk Observability Suite
|Monte Carlo||Arize |
Open-source data testing tools vs data observability tools
Similar to how software engineers use unit tests to identify buggy code before it’s pushed to production, data engineers often leverage tests to detect and prevent potential data quality issues from moving further downstream.
There are many open-source data testing solutions with dbt and Great Expectations likely being the most used for this purpose.
The reason even the best testing processes are insufficient is because there are two types of data quality issues: those you can predict (known unknowns) and those you can’t (unknown unknowns).
Some teams will have hundreds(!) of tests in place to cover most known unknowns but they don’t have an effective way to cover unknown unknowns.
In a Medium article, Vimeo Senior Data Engineer Gilboa Reif describes how using data observability and dimension monitors at scale help address the unknown unknowns gap that open source and transformation tools leave open.
“For example, if the null percentage on a certain column is anomalous, this might be a proxy of a deeper issue that is more difficult to anticipate and test.”
Choozle CTO Adam Woods says data observability gives his team a deeper insight than manual testing or monitoring could provide.
“Without a [data observability tool], we might have monitoring coverage on final resulting tables, but that can hide a lot of issues. You might not see something pertaining to a small fraction of the tens of thousands campaigns in that table, but the [customer] running that campaign is going to see it. With [data observability] we are at a level where we don’t have to compromise. We can have alerting on all of our 3,500 tables.”
To summarize, data observability is different and more effective than open-source data testing tools because it provides end-to-end coverage, is scalable, and has lineage that helps with impact analysis.
What’s next for data observability tools?
Personally, I couldn’t be more excited for this new frontier of data engineering. As data leaders increasingly invest in data reliability solutions that leverage data observability, I anticipate that this field will continue to intersect with some of the other major trends in data engineering, including: data mesh, machine learning, cloud data architectures, and the platformatization of data products.
Book a time to speak with us using the form below.