What is Data Observability? 5 Pillars You Need To Know
Editor’s Note: So much has happened since we first published this post and created the data observability category. We have updated this post to reflect this rapidly maturing space.
As the former VP of Customer Success Operations at Gainsight, I was responsible for leading a team that compiled a weekly report to our CEO outlining customer data and analytics.
Time and again, we’d deliver a report, only to be notified minutes later about issues with our data. It didn’t matter how strong our ETL pipelines were or how many times we reviewed our SQL: our data just wasn’t reliable.
Unfortunately, this problem wasn’t unique to Gainsight. After speaking with over 100s of data leaders about their biggest pain points, I learned that data downtime tops the list.
Data downtime — periods of time when data is partial, erroneous, missing, or otherwise inaccurate — only multiplies as data systems become increasingly complex, supporting an endless ecosystem of sources and consumers.
For data engineers and developers, data downtime means wasted time and resources; for data consumers, it erodes confidence in your decision making. Like me, the leaders I talked to couldn’t trust their data, and that was a serious problem.
Instead of putting together a holistic approach to address data downtime, teams often tackle data quality and lineage problems on an ad hoc basis.
Much in the same way DevOps applies observability to software, I thought it was time data teams leveraged this same blanket of diligence and began creating the category of data observability as a more holistic way to approach data quality.
In this post, I’ll cover:
- Data observability is as essential to DataOps as observability is to DevOps
- What is data observability? The five pillars
- The key features of data observability tools
- Data observability vs. testing
- Data observability vs. monitoring
- Data observability vs. data quality
- Data observability vs. data reliability engineering
- Data observability vs. data governance
- Signs you need a data observability platform
- The future of data observability
Data observability is as essential to DataOps as observability is to DevOps
Observability is no longer just for software engineering. With the rise of data downtime and the increasing complexity of the data stack, observability has emerged as a critical concern for data teams, too.
Developer Operations (lovingly referred to as DevOps) teams have become an integral component of most engineering organizations. DevOps teams remove silos between software developers and IT, facilitating the seamless and reliable release of software to production.
As organizations grow and the underlying tech stacks powering them become more complicated (think: moving from a monolith to a microservice architecture), it’s important for DevOps teams to maintain a constant pulse on the health of their systems.
Observability, a more recent addition to the engineering lexicon, speaks to this need, and refers to the monitoring, tracking, and triaging of incidents to prevent downtime.
As a result of this industry-wide shift to distributed systems, observability engineering has emerged as a fast-growing engineering discipline. At its core, there are three pillars of observability data:
- Metrics refer to a numeric representation of data measured over time.
- Logs, a record of an event that took place at a given timestamp, also provide valuable context regarding when a specific event occurred.
- Traces represent causally related events in a distributed environment.
(For a more detailed description of these, I highly recommend reading Cindy Sridharan’s landmark post, Monitoring and Observability).
Taken together, these three pillars give DevOps teams valuable awareness and insights to predict future behavior, and in turn, trust their system to meet SLAs. Abstracted to your bottom line, reliable software means reliable products, which leads to happy users.
Just like DevOps takes a continuous integration and development approach (CI/CD) to the development and operations of software, DataOps emphasizes a similar approach for how data engineering and data science teams can work together to add value to the business.
What is Data Observability? The five pillars
My data observability definition has not changed since I first coined it in 2019.
Data observability is an organization’s ability to fully understand the health of the data in their systems. It eliminates data downtime by applying best practices learned from DevOps to data pipeline observability.
Data observability tools use automated monitoring, alerting, and triaging to identify and evaluate data quality and discoverability issues. This leads to healthier pipelines, more productive teams, and happier customers.
The five pillars of data observability are:
Together, these components provide valuable insight into the quality and reliability of your data. Let’s take a deeper dive.
- Freshness: Freshness seeks to understand how up-to-date your data tables are, as well as the cadence at which your tables are updated. Freshness is particularly important when it comes to decision making; after all, stale data is basically synonymous with wasted time and money.
- Distribution: Distribution, in other words, a function of your data’s possible values, tells you if your data is within an accepted range. Data distribution gives you insight into whether or not your tables can be trusted based on what can be expected from your data.
- Volume: Volume refers to the completeness of your data tables and offers insights on the health of your data sources. If 200 million rows suddenly turns into 5 million, you should know.
- Schema: Changes in the organization of your data, in other words, schema, often indicates broken data. Monitoring who makes changes to these tables and when is foundational to understanding the health of your data ecosystem.
- Lineage: When data breaks, the first question is always “where?” Data lineage provides the answer by telling you which upstream sources and downstream ingestors were impacted, as well as which teams are generating the data and who is accessing it. Good lineage also collects information about the data (also referred to as metadata) that speaks to governance, business, and technical guidelines associated with specific data tables, serving as a single source of truth for all consumers.
The Key Features of Data Observability Tools
Thanks to DevOps, we have an easy lens with which to view the importance of observability as applied data. By surfacing data downtime incidents as soon as they arise, the five pillars provide the holistic data observability framework necessary for true end-to-end reliability.
As with traditional DevOps observability tools, the best data observability solutions don’t just monitor these pillars, but prevent bad data from entering them in the first place.
A great data observability platform has the following features:
- It connects to your existing stack quickly and seamlessly and does not require modifying your data pipelines, writing new code, or using a particular programming language. This allows quick time to value and maximum testing coverage without having to make substantial investments.
- It monitors your data at-rest and does not require extracting the data from where it is currently stored. This allows the data observability solution to be performant, scalable and cost-efficient. It also ensures that you meet the highest levels of security and compliance requirements.
- It requires minimal configuration and practically no threshold-setting. Data observability tools should use machine learning models to automatically learn your environment and your data. It uses anomaly detection techniques to let you know when things break. It minimizes false positives by taking into account not just individual metrics, but a holistic view of your data and the potential impact from any particular issue. You do not need to spend resources configuring and maintaining noisy rules within your data observability platform.
- It requires no prior mapping of what needs to be monitored and in what way. It helps you identify key resources, key dependencies and key invariants so that you get broad data observability with little effort.
- It provides rich context that enables rapid triage and troubleshooting, and effective communication with stakeholders impacted by data reliability issues. Data observability tools shouldn’t stop at “field X in table Y has values lower than Z today.”
- It prevents issues from happening in the first place by exposing rich information about data assets so that changes and modifications can be made responsibly and proactively.
Data observability vs. testing
Similar to how software engineers use unit tests to identify buggy code before it’s pushed to production, data engineers often leverage tests to detect and prevent potential data quality issues from moving further downstream.
This approach was (mostly) fine until companies began ingesting so much data that a single point of failure just wasn’t feasible.
I’ve encountered countless data teams that suffer consistent data quality issues despite a rigorous testing regime. It’s deflating and a bad use of your engineers’ time.
The reason even the best testing processes are insufficient is because there are two types of data quality issues: those you can predict (known unknowns) and those you can’t (unknown unknowns).
Some teams will have hundreds(!) of tests in place to cover most known unknowns but they don’t have an effective way to cover unknown unknowns.
Some examples of unknown unknowns covered by data observability include:
- A Looker dashboard or report that is not updating, and the stale data goes unnoticed for several months—until a business executive goes to access it for the end of the quarter and notices the data is wrong.
- A small change to your organization’s codebase that causes an API to stop collecting data that powers a critical field in your Tableau dashboard.
- An accidental change to your JSON schema that turns 50,000 rows into 500,000 overnight.
- An unintended change happens to your ETL, ELT or reverse ETL that causes some tests not to run, leading to data quality issues that go unnoticed for a few days.
- A test that has been a part of your pipelines for years but has not been updated recently to reflect the current business logic.
In a Medium article, Vimeo Senior Data Engineer Gilboa Reif describes how using data observability and dimension monitors at scale help address the unknown unknowns gap that open source and transformation tools leave open.
“For example, if the null percentage on a certain column is anomalous, this might be a proxy of a deeper issue that is more difficult to anticipate and test.”
Choozle CTO Adam Woods says data observability gives his team a deeper insight than manual testing or monitoring could provide.
“Without a [data observability tool], we might have monitoring coverage on final resulting tables, but that can hide a lot of issues. You might not see something pertaining to a small fraction of the tens of thousands campaigns in that table, but the [customer] running that campaign is going to see it. With [data observability] we are at a level where we don’t have to compromise. We can have alerting on all of our 3,500 tables.”
To summarize, data observability is different and more effective than testing because it provides end-to-end coverage, is scalable, and has lineage that helps with impact analysis.
Data observability vs. monitoring
Data monitoring and data observability have been used interchangeably for a long time, but they are two very different things.
Data observability enables monitoring (or data quality monitoring) by alerting teams when a data asset or data set looks different than the established metrics or parameters say it should.
For example, data monitoring would issue an alert if a value falls outside an expected range, data hasn’t updated as expected, or 100 million rows suddenly turn into 1 million. Monitoring issues alerts based on pre-defined problems, representing data in aggregates and averages.
You still have the same unknown unknowns gap that arises with data testing. And, before you can set up monitoring for a data ecosystem, you need visibility into all of those data assets and attributes — that’s where data observability comes in.
Another way to think about it is imagine you were told there was a problem somewhere in your house that needed to be fixed. That’s not very helpful.
When problems do arise, data observability (and specifically visibility into schema and lineage) help swiftly answer pertinent questions about what data was impacted; what changes may have been made, when, and by whom; and which downstream consumers may be impacted.
Observability accelerates your data team’s journey from the what to the why. That’s helpful.
Data observability vs data quality
Just like data observability enables data monitoring, it enables data quality too. Data quality is often expressed in the six dimensions of accuracy, completeness, consistency, timeliness, validity, and uniqueness.
I’ve found that in the business world, the reality is data quality is a binary metric. Your CFO doesn’t come up to you and say, “the data was accurate but out of date so I’m considering it to be of average quality.”
For your data consumers either the data quality is good or it’s bad. Just like a SaaS solution, either it’s working or it’s not. That’s why we created the metric of data downtime.
Data observability is a DataOps process that takes into account the five key pillars of data health (freshness, distribution, volume, schema, and lineage) to increase data quality and reduce the amount of data downtime.
With a data observability solution in place, data teams can ensure they have high data quality.
Data observability vs data reliability engineering
Some data observability companies have started to describe themselves or their tools in the framework of data reliability engineering.
This makes sense as data observability borrows heavily from observability and other concepts of site reliability engineering (SRE). While different solutions or tools may have significant differences in features offered, there is no real difference between data observability and data reliability engineering.
Both terms are focused on the practice of ensuring healthy, high quality data across an organization.
Data observability vs data governance
Traditionally, data governance is defined as the process of maintaining the availability, usability, provenance, and security of data.
While data governance is widely accepted as a must-have feature of a healthy data strategy (data observability), it’s hard to achieve in practice, particularly given the demands of the modern data stack.
While we’ve made great advancements in areas such as self-service analytics, cloud computing, and data visualization, we’re not there yet when it comes to data governance. Many companies continue to enforce data governance through manual, outdated, and ad hoc tooling such as data catalogs.
Data governance needs to be:
Data observability tools play a critical role in establishing and maintaining data governance.
As previously mentioned, data observability refers to an organization’s ability to fully understand the health of the data in their system, and supplements data discovery by ensuring that the data you’re surfacing is trustworthy at all stages of its life cycle.
With data observability, you can monitor changes in the provenance, integrity, and availability of your organization’s data, leading to more collaborative teams and happier stakeholders.
The data discovery capabilities in data observability platforms can replace the need for a traditional data governance platform by providing a domain-specific, dynamic understanding of your data based on how it’s being ingested, stored, aggregated, and used by a set of specific consumers.
Governance standards and tooling are federated across these domains (allowing for greater accessibility and interoperability), a real-time understanding of the data’s current (as opposed to ideal) state is made easily available.
Signs you need a data observability platform
From speaking with hundreds of customers over the years, I have identified seven telltale signs that suggest your data team should prioritize data quality.
- You’re data platform has recently migrated to cloud
- Your data stack is scaling with more data sources, more tables, and more complexity
- Your data team is growing
- Your team is spending at least 30% of their time firefighting data quality issues
- Your team has more data consumers than you did 1 year ago
- Your company is moving to a self-service analytics model
- Data is a key part of the customer value proposition
The future of data observability
Data observability is a rapidly maturing but still evolving space. For example, as of this writing there is not a data observability Gartner Magic Quadrant.
However, multiple companies and technologies are identifying with the term data observability. There has been tremendous investor activity, and, most importantly, customer interest and values are at all-time highs as evidenced by our 100% renewal rate in 2021.
Data observability took center stage and was recognized as an indispensable component of the modern data stack at our first ever IMPACT Summit featuring leaders such as:
- Zhamak Dehghani, Founder of the Data Mesh
- Neha Narkhede, Creator of Apache Kafka & Co-founder, Confluent
- Amit Agarwal, Chief Product Officer, Datadog
- Reynold Xin, Co-founder and Chief Architect, Databricks
- Maxime Beauchemin, Creator of Apache Airflow & Co-founder of Preset
- DJ Patil, First U.S. Chief Data Scientist
- Bob Muglia, Former CEO, Snowflake
I see a bright future for data observability as data continues its march from dusty dashboards to the boardroom, machine learning models, operational systems, customer facing products, and more.
With data observability, data quality and data engineering are finally getting a seat at the table.
If you want to learn more, reach out to Barr Moses and book a time to speak with us in the form below.