How Hotjar Reduced Data Infrastructure Costs by 3x with Monte Carlo

For Hotjar, a global product experience insights company, data powers a wide variety of use cases, from crafting the ideal marketing campaign to creating delightful product features. To power their data-driven decision-making approach, Hotjar leverages an extensive amount of third-party data from providers such as FacebookLinkedInSalesforce, and Zuora.

Hotjar’s data engineering team supports over 180 stakeholders and their data needs, from deploying models and building pipelines to keeping tabs on data health.

To ensure that their data pipelines are reliable and trustworthy, Hotjar relied on dbt for testing and transforming their data before it entered their business intelligence layer. However, this approach led to frequent issues with alerting around pipeline delays. To accommodate for this gap and to supplement their testing strategy, Hotjar’s data engineering team chose to implement Monte Carlo for end-to-end data observability, monitoring, and field-level lineage. 

This decision came to the forefront recently when Monte Carlo alerted the Hotjar team to unprecedented levels of traffic into Segment, their customer data platform (CDP), a costly bout of data downtime that would have otherwise gone undetected by line tests. 

What happened? 

The data downtime alert that started it all… Image courtesy of Monte Carlo.

On July 1, 2021, Hotjar received an alert from Monte Carlo showing an unusual amount of traffic sent from one of their marketing tools to Segment they knew something was wrong.

“It was about 400,000 [events] when normally we were getting 20,000…[it] almost blew up half of our monthly tracked user (MTU) usage,” said Pablo Recio, a data engineer at Hotjar. “Monte Carlo blew the alarm so we could start digging into the issue,” said Pablo. “Without them, there’s no way we would have uncovered this data downtime.” 

After doing a bit of sleuthing with Incident IQ, Monte Carlo’s all-in-one data incident management and root cause analysis platform, Pablo and his team discovered that they were almost at 80% of their MTU capacity for Segment

With Incident IQ, Hotjar’s data team could understand the root cause, upstream and downstream dependencies, and other critical contextual information about their Segment data incident. Image courtesy of Monte Carlo.

To understand what was causing the downtime, they leveraged Monte Carlo’s end-to-end lineage to understand upstream and downstream dependencies related to the issue so they could do impact assessment and identify the root cause of the issue. From there, the team could correct the course and identify those who need to know about the incident.

After Pablo and the team were alerted to the issue, they were able to contact Segment and work out a solution.

Given that Segment only notifies you when you’ve used 80 percent of your quota, their notification didn’t come until 8 days later. Meanwhile, Monte Carlo alerted them just hours after the error occurred. 

“That [was] quite late compared to when the issue actually happened,” said Pablo. By using Monte Carlo, the team was alerted of and resolved the issue in 2 short hours, compared to over a week later, after the damage was done.

Staying proactive with data observability 

This isn’t the only instance where Hotjar has benefited from data observability. They’ve found the platform to be critical in their mission to achieve trustworthy, reliable data for the entire company.

“In some instances, Monte Carlo has notified us of bugs in our product because they notified us when data wasn’t being refreshed,” added Pablo. 

He recalled a recent scenario when there was a survey sent out to users, and a change in the code prevented the survey from being displayed in email campaigns. Because of this, Hotjar stopped receiving data from that survey and was immediately alerted by Monte Carlo of the issue. Pablo and his team reached out to the stakeholder that owned the survey, alerting them of a potential issue and so that they could solve it as quickly as possible. 

The best part? Hotjar’s data engineering team cites data observability with ensuring they stay proactive, as opposed to reactive when it comes to building a truly data-driven culture at their company. 

Now when there are events that start to send more frequently or even less than expected, the data engineering team can reach out to affected teams to understand the source of the problem and course correct appropriately.

“Monte Carlo gives us the power to know what’s going on with our data at any given point in time so we can ask the right questions when data downtime strikes, for instance ‘we think something’s wrong here, did you change anything, or is this expected?’”

When it came to keeping data infrastructure costs down, Hotjar’s decision to take the proactive approach to data trust by supplementing testing with end-to-end data observability made all the difference. 

Interested in learning more about how data observability can help your team avoid unintentional overages or resolve other data issues? Reach out to the Monte Carlo team for a demo