Data Observability: How Blinkist Prevents Broken Data Pipelines at Scale with Monte Carlo
Companies spend upwards of $15 million annually tackling data downtime, in other words, periods of time where data is missing, broken, or otherwise erroneous, and over 88 percent of U.S. businesses have lost money as a result of data quality issues.
Fortunately, there’s hope in the next frontier of data engineering: observability. Here’s how the data engineering team at Blinkist, a book-summarizing subscription service, increases cost savings, collaboration, and productivity with data observability at scale.
With over 16 million users worldwide, Blinkist helps time-strapped readers fit learning into their lives through their ebook subscription service.
Gopi Krishnamurthy, Director of Engineering, leads the team responsible for data engineering, infrastructure, cloud center-of-excellence, growth, and monetization. For Blinkist, having trustworthy and reliable data is foundational to the success of their business.
The challenge: broken data pipelines impacting growth, user experience, and reliability
As a high-growth company, Blinkist leverages paid performance marketing to fuel customer acquisition. Their 2020 strategy—with an ambitious 40 percent growth target—included a significant investment in channels like Facebook and Google, which would auto-optimize campaigns based on behavioral data shared between the Blinkist app and the channels themselves.
Of course, like so many companies in 2020, the COVID-19 pandemic changed everything. Now, historic data didn’t reflect the current reality of their audience’s daily lives, and real-time data became essential—not just for determining advertising spend, but for understanding the current state of how users were interacting with the Blinkist app and content across the web.
Any inaccuracies in this data could impact decision-making, from campaign spending to updating the product roadmap. It was crucial that no opportunities to innovate were missed, from adding new features to simplifying onboarding to testing new advertisements—because a campaign around “improving your commute” just wasn’t relevant anymore.
As C-level execs and campaign managers grew increasingly dependent on real-time insights to drive marketing strategy, budget spend, and ROI, Gopi and his team were struggling with data downtime—issues with data quality, dashboard update delays, and broken pipelines.
“Every Monday, we had executive calls,” said Gopi. “And almost every Monday, I was on this call trying to answer why we are not able to scale, what were the issues, how many problems we face in terms of tracking data…trying to explain the severity of the problem and trying to boost confidence with executive stakeholders.”
Gopi estimates his team was spending 50 percent of their working hours firefighting data drills, trying to resolve data downtime issues while rebuilding trust with the rest of the organization. It wasn’t sustainable – something had to change.
The solution: data observability with Monte Carlo
So in the fall of 2020, Gopi and his team regrouped and refocused. They built a plan modeled on the thoughtful execution framework popularized by Spotify, setting a clear goal to build trust in data at their company.
“At the core of this framework is data reliability engineering—that we treat data reliability as a first-class citizen, the same way engineering teams in the last decade have started to treat DevOps and site reliability engineering,” said Gopi.
Foundational to achieving data reliability is a focus on data governance, data quality, and refactoring systems.
“As we shifted to try to bring in data reliability engineering principles, Monte Carlo played a key role for us to easily adopt and meet these three expectations in a short timeframe,” said Gopi
Outcome: Faster data incident resolution through self-service tooling & clear data reliability SLAs
With no-code onboarding, Monte Carlo was up and running in fewer than two weeks, delivering immediate visibility into the health of their data pipelines and critical assets, speeding up their incident response times considerably.
“We could immediately see what was happening,” said Gopi. “Day-to-day, we could see if there was a broken pipeline, a table that was not updated, or a table that had changed its data model because something was added or deleted on the upstream.”
As Gopi and his team worked to rebuild broken trust along with broken pipelines, they partnered with company leaders to build a shared understanding of data reliability principles and set concrete data SLAs (service-level agreements).
Data stakeholders were also granted access to Monte Carlo reporting, increasing transparency about data health across the company.
“The self-service capabilities of data observability helped build back trust in data, as users were seeing us in action: going from a red alert to a blue “work-in-progress” to “resolved” in green,” said Gopi. “They knew who was accountable, they knew the teams were working on it, and everything became crystal clear.”
Outcome: Time savings of 120 hours per week through automated monitoring and alerting of critical data assets
Monte Carlo detects anomalies across the Blinkist data landscape, using machine learning algorithms to generate the thresholds and rules that govern data downtime alerting. This automated monitoring saves Gopi’s team up to 20 hours per engineer each week—and would have been impractical to develop in-house. This leads to cumulative time savings of 120 hours per week for Gopi’s team, energy that can now be spent building their product or otherwise innovating.
“Especially given the timeframe that we were working with, a data observability platform is not something we could have built,” said Gopi. “This is basically the power of AI that runs behind Monte Carlo—to build this kind of tool, you’d need to have a lot of internal knowledge to build these business rules and create these alerts.”
And thanks to the aforementioned self-serve reporting and data SLAs, data observability also helps stakeholders work more efficiently.
For example, when a channel manager notices a campaign is underperforming, they can easily access data reporting and see if data reliability SLAs have been met and data pipelines are working properly. If so, they can eliminate bad data as the culprit and look at other solutions, like changing advertising creatives or adjusting the target audience—without ever requesting time or effort from their colleagues on the data team.
Outcome: Increased revenue by preventing broken data pipelines and dashboards
As Blinkist was able to detect and resolve data downtime more rapidly, their marketing channels thrived, leading to increased revenue.
“If we were able to identify and resolve issues within 24 hours, Facebook or Google could auto-correct and never scale down campaigns,” Gopi said.
With more accurate analytics and newly restored trust in their data, Blinkist marketers are now able to make swift decisions to optimize their ad spend for better targeting and performance.
“The scale of growth that we’ve seen this year is overwhelming,” Gopi said. “Although the data teams can’t take full credit, I definitely think the things we were able to do—in terms of data observability and bringing transparency into data operations—improved how we target our audience and channels.”
The impact of data observability at Blinkist
Data observability has helped Blinkist increase revenue, save time, and rebuild trust and transparency in data throughout the organization. With broken data pipelines under control, their data engineers are focusing on innovation and solving core business problems—not firefighting.
Among other benefits of data observability, Monte Carlo has enabled Blinkist to:
- Increase revenue by ensuring that marketing spend was allocated appropriately by reducing time to resolution of tedious data fire drills and restore trust in data for vital decision making
- Save more than 20 hours per week, per engineer by eliminating the need to troubleshoot tedious data fire drills, for cumulative savings of hundreds of hours per week
- Drive greater efficiency and collaboration by gaining end-to-end visibility into the health, usage patterns, and relevancy of data assets
“Monte Carlo made life easier by automating anomaly detection in terms of freshness, data volume, and data model changes,” said Gopi. “It’s quite helpful for us to act at the right time and to make sure that our data downtime is reduced—or even prevented.”
Special thanks to Gopi and the rest of the Blinkist team!