How the GitLab Data Team Builds a Culture of Radical Transparency
GitLab has made a name for itself as a company that prioritizes radical transparency – in other words, building an open and collaborative culture, not just internally but with the broader technical community. We sat down with Rob Parker, Senior Director of Data & Analytics, to understand how he applies this concept to his work building and scaling GitLab’s data platform with collaboration, scalability, and trust in mind.
GitLab is an open-source software development platform, and the company is famous for its open-core ethos. The organization lives by the philosophy of radical transparency, and my recent conversation with Rob Parker, Senior Director of Data and Analytics, was no exception.
Rob was amazingly candid in sharing how his data team tackles complex problems at GitLab. We always appreciate an honest peek behind the curtain, so here are six of the key takeaways from our conversation.
Check out Rob’s full conversation on radical transparency in data engineering during our IMPACT 2022 conference, available on demand.
What radical transparency means at GitLab
Rob is the first to admit that radical transparency is not easy. It takes work to create and maintain—and at GitLab, radical transparency means sharing almost everything. Internally and externally, from organizational structures to first drafts to self-serve data, transparency is the name of the game.
Sharing the inner workings of GitLab with the world
Nearly all of the GitLab processes, frameworks, codebases, and other technical materials—along with team and organizational structures—are shared freely in their public handbook. There are over 2,000 pages of content that anyone can peruse right now on the inner workings of GitLab. And the data team’s work is no exception.
“We share our dbt code, our data models, our operational content, how we run our data program—all of it is available in our public-facing handbook,” said Rob.
And the public can weigh in, giving feedback on ideas or code, and even correcting typos. “The GitLab code base is open-core, so people can review, download, change, or modify the code base as well,” said Rob. “And you can also submit your request for changes back into GitLab. We actually have a community team within GitLab who is actively responsible for receiving, reviewing, and integrating those changes into our code base.”
Making works-in-progress available internally
Within GitLab, radical transparency also means sharing work well before a project is finalized and polished. That’s because most work is internally available to everyone else in the company—even in draft form. This leads to early, frequent feedback.
“If I’m working on a document or a slide, it may be incomplete, it may look bad, it may be in draft mode, but it’s available to everyone,” said Rob. “And that’s incredibly liberating and useful because it gives people an early view into what we’re working on, and opens up the doors to provide feedback and improve in a very iterative fashion.”
Making data radically transparent
These open lines of communication extend to the data team, and Rob acknowledges how unusual that can be.
“I think many of us have come from a situation where it’s hard to drive alignment around what the data programs are, or where a particular dashboard is, or on a particular analytics research project,” he said. “Well, having all of that content centralized and made available drives alignment.”
When Rob and his team field questions about tables or reporting, they can point their colleagues to the data team section of the handbook, where they can self-serve and find the content themselves. This visibility drives greater alignment between the data team and the rest of the company.
Giving back to the data community (and attracting the right talent)
Outside of GitLab, Rob’s team is actively involved in the larger data community—giving talks (like ours), taking part in LinkedIn conversations, and sharing ideas through their handbook. Their open-book approach stems from the company’s roots in open-core software.
One benefit of radical transparency? Attracting new talent is made much easier. Potential applicants already know what kinds of technologies and frameworks the team uses, as well as how roles and responsibilities are structured on the team.
“People know what they’re getting involved with on the data side when they’re coming into GitLab, because we’ve made so much of that content public,” said Rob.
How GitLab structures its data team & data stack
True to form, Rob gave us an inside look at how GitLab has architected its data tech stack and how the data team is organized to best meet its business needs.
The GitLab data stack
Using a cloud-based and modular data stack makes it easy for the data team to scale while serving distributed stakeholders. Their data platform includes:
- Pipelines – Fivetran and Stitch, plus custom code for more complicated data sources without a robust API
- Data warehouse – Snowflake
- Modeling – dbt
- Data visualization – Sisense
- Data observability – Monte Carlo
The GitLab data team structure—and its mission
Rob leads a central data infrastructure team that’s responsible for data tools and data models, and focuses primarily on data engineering, analytics engineering, data science, and some analytics.
Within the larger business, smaller data subteams are embedded across marketing, sales, customer success, and other departments. These teams are responsible for understanding their subject areas and leveraging the data infrastructure provided by the central infrastructure group.
Since the central data team isn’t working directly with internal or external data consumers, Rob places a significant emphasis on staying focused on the end needs of the GitLab customer. Rob refers to the external customer as the north star, and to internal stakeholders as business partners or data champions. This helps reinforce the concept that the end result for the customer is the ultimate goal of his team’s work—not merely meeting the specs of a request.
“It’s very easy as a team embedded in the business that’s not directly impacting or interfacing with our customers to lose sight of what our work means to our customers,” said Rob. “So we try to think about our customer’s customer. We’ve been able to move away from being the typical order taker into being a trusted business partner in the journey of building scalable and reliable solutions for the business.”
How does Rob know this customer-centric approach is working? He looks to the data, of course. Understanding how their work impacts the business and the end consumer has demonstrably improved data team members’ job satisfaction and retention rates.
How the data capability model prepped GitLab for IPO
Readying a company for an IPO is usually a hush-hush affair. But GitLab will tell you (nearly) everything they learned—including how the data team needed to prepare.
When Rob joined the company, his goal was to ready GitLab’s data program for going public (something he had direct experience with during his tenure at Docusign). To start, he used a data capability model to capture and improve data maturity.
Rob used existing data maturity models as a starting point, specifically calling out Gartner’s models and those published in Tom Davenport’s Competing on Analytics.
“We built out a list of capabilities, types of dashboards, and the types of analytics that we would want to achieve at each of those levels leading up to having trusted, reliable data models so we could be a public company,” said Rob. “So that our data models could stand up to audit, and so that we could support our CFO team and our quarterly earnings team with these trusted metrics.”
Using the data capability model has helped the GitLab team paint a picture of where they were and where they wanted to be, and helped them along the data development journey. Regardless of IPO status, Rob believes this approach can help every company—if it’s tailored to meet their needs.
“I think these models do have to be customized for every business, because not every data team is responsible for the same set of capabilities,” said Rob. “Some are responsible for public-facing metrics, while others aren’t. But the general framework of having a capability maturity model of some sort—measuring yourself versus that model and then holding yourself accountable to meet those different levels over time—I’ve found is incredibly valuable.”
How GitLab approaches build vs buy
One of the hottest topics for data teams is whether to build their own infrastructure tooling or buy an off-the-shelf solution. In true GitLab fashion, their decision-making framework is documented and published for all to see in their handbook: a Proof of Value guide.
Rob’s team makes their build-or-buy decisions based on time to value, overall cost, and urgency of the need. And more often than not, that means there’s no reason to build it in-house if commercially available solutions already have baked-in, production-ready capabilities.
“Generally, we’ll look to buy something because the time to analytics delivery or the time to that really critical business decision is vital at our stage,” said Rob.
Exceptions do crop up, including some custom data integrations into Snowflake, when extra flexibility is needed or an existing solution doesn’t exist in the market (yet).
How GitLab solves for data reliability
For a long time, GitLab used a homegrown system in an attempt to handle data reliability. With over 35 data pipelines running within a sizeable system with thousands of data consumers, Rob knew his team needed to deliver data that could be trusted.
“I think any data leader will tell you one of the things that keep them up at night is whether or not their data’s ready for the morning’s reports or the next day’s analytics,” said Rob.
Rob’s team built manual tests within dbt and throughout their data stack to try and check for accuracy, freshness, and other attributes of data quality. “But it’s very time-consuming to do that,” said Rob. “You can imagine onboarding a new data set that has a dozen tables, building row count tests, data volume tests—it’s just incredibly, incredibly time intensive.”
Switching from homegrown testing to data observability
So when Monte Carlo’s data observability platform came to the market—offering automated monitoring, alerting, and lineage—things changed.
“We took Monte Carlo through our Proof of Value process,” said Rob. “We wanted to make sure we were making the right decision because once you onboard a new piece of technology in your stack, it’s there for the long term.”
Monte Carlo met the GitLab requirements: it helped Rob’s team deliver more trusted, reliable business data. And Rob was ready to make the switch. “I’ve been looking for something like a Monte Carlo my entire career,” he said.
Automating data observability with Monte Carlo
Monte Carlo works within the GitLab data stack to automatically monitor for data incidents, and route alerts to the right teams needed to troubleshoot and address them.
“We have quite a bit of data to manage, and it’s helped us measure the overall impact and reliability of the system,” said Rob. “We have embedded Monte Carlo as part of what we call our data daily triage process. Every single day someone gets to carry the daily beeper and be responsible for making sure that all of our data pipelines have been refreshed, and that our models are ready for the business. And Monte Carlo has become an integral part of that process.”
Increasing transparency around data quality
Monte Carlo has been adopted by the GitLab data engineering team, and is being rolled out to analytics engineers who are one step closer to the business. Rob’s goal is to make Monte Carlo available to data analysts as well. In his view, the more eyes on the state of their data, the better.
“Monte Carlo is giving us that capability to be incredibly transparent about the state and the health of our system,” said Rob. “We tackle those problems head-on. We don’t shy away from them. And we feel that if somebody cares about a problem, they’ll dig in, they’ll understand. We embrace transparency, and so we don’t have any concerns about sharing the state and the health of the dashboard with more people.”
What’s next for data at GitLab
Rob doesn’t shy away from speaking about what the future holds for his team.
First up, they’ll be working to build a customer 360 view, calling their version “Customer Journey Analytics”. Rob and his team want to provide visibility and health scores for all customers at each major stage of the GitLab lifecycle, from the first time they experience a product to the time they’re healthy and expanding paying customers.
Next, they plan to onboard and integrate with a next-gen business intelligence tool. They also want to deepen their investment in machine learning and data science, with a focus on building out DataOps tools to help scale and productionalize ML models.
Finally, Rob says the team has plans to scale data observability. This includes making Monte Carlo available on the desktop, integrating more complicated tests and evaluations within Monte Carlo, and putting more robust processes around incident management into place.
“It’s going to be a very busy year for us,” said Rob.
If you’re interested in digging deeper on some of these concepts, here is a list of some of Rob’s favorite books on analytics and people management:
Analytics Strategy & Impact
– Competing on Analytics: The New Science of Winning – Thomas H. Davenport & Jeanne G. Harris
– Lean Analytics: Use Data to Build a Better Startup Faster – Alistair Croll and Benjamin Yoskovitz
– Creating a Data-Driven Organization: Practical Advice from the Trenches – Carl Anderson
– The Self-Service Data Roadmap – Sandeep Uttamchandani
– Getting in Front on Data: Who Does What – Thomas C. Redman
Team and People Management
– Peopleware: Productive Projects and Teams – Tom DeMarco and Tim Lister
– Managing the Unmanageable – Mickey Mantle & Ron Lichty
Make your own data reliability more transparent
GitLab’s commitment to radical transparency may be unique, but the problems they’re addressing around data reliability are downright universal.
Learn how GitLab and other data-first companies are solving for data trust at scale with Monte Carlo’s data observability platform:
Our promise: we will show you the product.