Observability is no longer just for software engineering. With the rise of data downtime and the increasing complexity of the data stack, observability has emerged as a critical concern for data teams, too.
Developer Operations (lovingly referred to as DevOps) teams have become an integral component of most engineering organizations. DevOps teams remove silos between software developers and IT, facilitating the seamless and reliable release of software to production.
As organizations grow and the underlying tech stacks powering them become more complicated (think: moving from a monolith to a microservice architecture), it’s important for DevOps teams to maintain a constant pulse on the health of their systems. Observability, a more recent addition to the engineering lexicon, speaks to this need, and refers to the monitoring, tracking, and triaging of incidents to prevent downtime.
As a result of this industry-wide shift to distributed systems, observability engineering has emerged as a fast-growing engineering discipline. At its core, observability engineering is broken into three major pillars:
- Metrics refer to a numeric representation of data measured over time.
- Logs, a record of an event that took place at a given timestamp, also provide valuable context regarding when a specific event occurred.
- Traces represent causally related events in a distributed environment.
(For a more detailed description of these, I highly recommend reading Cindy Sridharan’s landmark post, Monitoring and Observability).
Taken together, these three pillars give DevOps teams valuable insights to predict future behavior, and in turn, trust their system to meet SLAs. Abstracted to your bottom line, reliable software means reliable products, which leads to happy users.
Even with the best-in-class observability solutions on tap, however, no amount of fancy tooling or engineering jargon can make customers happy if your data isn’t reliable.
The rise of data downtime
As a VP of Customer Success Operations at Gainsight, I was responsible for leading a team that compiled a weekly report to our CEO outlining customer data and analytics. Time and again, we’d deliver a report, only to be notified minutes later about issues with our data. It didn’t matter how strong our pipelines were or how many times we reviewed our SQL: our data just wasn’t reliable.
Unfortunately, this problem wasn’t unique to Gainsight. After speaking with over 100s of data leaders about their biggest pain points, I learned that data downtime tops the list. Data downtime — periods of time when data is partial, erroneous, missing, or otherwise inaccurate — only multiplies as data systems become increasingly complex, supporting an endless ecosystem of sources and consumers.
For data engineers and developers, data downtime means wasted time and resources; for data consumers, it erodes confidence in your decision making. Like me, the leaders I talked to couldn’t trust their data, and that was a serious problem.
Introducing: Data Observability
Instead of putting together a holistic approach to address data downtime, teams often tackle data quality and lineage problems on an ad hoc basis. Much in the same way DevOps applies observability to software, I think it’s about time we leveraged this same blanket of diligence for data.
Data Observability, an organization’s ability to fully understand the health of the data in their system, eliminates data downtime by applying best practices of DevOps Observability to data pipelines. Like its DevOps counterpart, Data Observability uses automated monitoring, alerting, and triaging to identify and evaluate data quality and discoverability issues, leading to healthier pipelines, more productive teams, and happier customers.
To make it easy, I’ve broken down Data Observability into its own five pillars: freshness, distribution, volume, schema, and lineage. Together, these components provide valuable insight into the quality and reliability of your data.
- Freshness: Freshness seeks to understand how up-to-date your data tables are, as well as the cadence at which your tables are updated. Freshness is particularly important when it comes to decision making; after all, stale data is basically synonymous with wasted time and money.
- Distribution: Distribution, in other words, a function of your data’s possible values, tells you if your data is within an accepted range. Data distribution gives you insight into whether or not your tables can be trusted based on what can be expected from your data.
- Volume: Volume refers to the completeness of your data tables and offers insights on the health of your data sources. If 200 million rows suddenly turns into 5 million, you should know.
- Schema: Changes in the organization of your data, in other words, schema, often indicates broken data. Monitoring who makes changes to these tables and when is foundational to understanding the health of your data ecosystem.
- Lineage: When data breaks, the first question is always “where?” Data lineage provides the answer by telling you which upstream sources and downstream ingestors were impacted, as well as which teams are generating the data and who is accessing it. Good lineage also collects information about the data (also referred to as metadata) that speaks to governance, business, and technical guidelines associated with specific data tables, serving as a single source of truth for all consumers.
Unlocking Data Observability at your company
Thanks to DevOps, we have an easy lens with which to view the importance of observability as applied data. By surfacing data downtime incidents as soon as they arise, the five pillars of Data Observability provide the holistic framework necessary for true end-to-end reliability.
As with traditional DevOps Observability tools, the best Data Observability solutions will not just monitor these pillars, but prevent bad data from entering them in the first place.
We believe that a great Data Observability solution has the following characteristics:
- It connects to your existing stack quickly and seamlessly and does not require modifying your pipelines, writing new code, or using a particular programming language. This allows quick time to value and maximum testing coverage without having to make substantial investments.
- It monitors your data at-rest and does not require extracting the data from where it is currently stored. This allows the solution to be performant, scalable and cost-efficient. It also ensures that you meet the highest levels of security and compliance requirements.
- It requires minimal configuration and practically no threshold-setting. It uses ML models to automatically learn your environment and your data. It uses anomaly detection techniques to let you know when things break. It minimizes false positives by taking into account not just individual metrics, but a holistic view of your data and the potential impact from any particular issue. You do not need to spend resources configuring and maintaining noisy rules.
- It requires no prior mapping of what needs to be monitored and in what way. It helps you identify key resources, key dependencies and key invariants so that you get broad observability with little effort.
- It provides rich context that enables rapid triage and troubleshooting, and effective communication with stakeholders impacted by data reliability issues. It doesn’t stop at “field X in table Y has values lower than Z today.”
- It prevents issues from happening in the first place by exposing rich information about data assets so that changes and modifications can be made responsibly and proactively.
Moreover, when issues do arise, these tools will alert your team before anyone else. Yes, even your CEO.
If you want to learn more, reach out to Barr Moses.