The Quick and Dirty Guide to Building Your Data Platform
One of the most frequent questions we get from customers is “how do I build my data platform?”
For most organizations, building a data platform is no longer a nice-to-have but a need-to-have, with many companies distinguishing themselves from the competition based on their ability to glean actionable insights from their data.
Still, justifying the budget, resources, and timelines required to build a data platform from scratch is easier said than done. Every company is at a different stage in their data journey, making it harder to prioritize what parts of the platform to invest in first. Like any new solution, you need to 1) set expectations around what the product can and can’t deliver and 2) plan for both long-term and short-term ROI.
To make things a little easier, we’ve outlined the 6 must-have layers you need to include in your data platform and the order in which many of the best teams choose to implement them.
Introducing: the 6-layer data platform
Second to “how do I build my data platform?”, the most frequent question I get is “where do I start?”
It goes without saying that building a data platform isn’t a one-size-fits-all experience, and the layers (and tools) we discuss only scratch the surface of what’s available on today’s market. The “right” data stack will look vastly different for a 5,000-person e-commerce company than it will for a 200-person startup in the FinTech space, but there are a few core layers that all data platforms must have in one shape or another.
Keep in mind: just as you can’t build a house without a foundation, frame, and roof, at the end of the day, you can’t build a true data platform without each of these 6 layers. But how you choose to build your platform is entirely up to you.
Below, we share what the “basic” data platform looks like and list some hot tools in each space (you’re likely using several of them):
The first layer? Data ingestion.
Data can’t be processed, stored, transformed, and applied unless it’s been ingested first. As is the case for nearly any modern data platform, there will be a need to ingest data from one system to another. As data infrastructures become increasingly complex, data teams are left with the challenging task of ingesting structured and unstructured data from a wide variety of sources. This is often referred to as the extraction and loading stage of Extract Transform Load (ETL) and Extract Load Transform (ELT).
Below, we outline some popular tools in the space:
- Fivetran – A leading enterprise ETL solution that manages data delivery from the data source to the destination.
- Singer – An open source tool for moving data from any source to any destination.
- Stitch – A cloud-based open source platform that allows you to rapidly move data from any source to any destination.
- Airbyte – An open source platform that easily allows you to sync data from applications.
- Apache Kafka – An open source event streaming platform to handle streaming analytics and data ingestion
Even with the prevalence of ingestion tools available on today’s market, some data teams choose to build custom code to ingest data from internal and external sources, and many organizations even build their own custom frameworks to handle this task.
Orchestration and workflow automation, featuring such tools as Apache Airflow, Prefect, and Dagster, often folds into the ingestion layer, too. Orchestration takes ingestion a step further by taking siloed data, combining it with other sources, and makes it available for analysis.
I would argue, though, that orchestration can be (and should be) weaved into the platform after you handle the storage, processing, and business intelligence layers. You can’t orchestrate without an orchestra of functioning data, after all!
Data Storage and Processing
After you build your ingestion layer, you need a place to store and process your data. With companies moving their data landscapes to the cloud, the emergence of cloud-native data warehouses, data lakes, and even data lakehouses have taken over the market, offering more accessible and affordable options for storing data relative to many on-prem solutions.
Whether you choose to go with a data warehouse, data lake or some combination of both is entirely up to the needs of your business. Recently, there’s been a lot of discussion around whether to go with open source or closed source solutions (the dialogue between Snowflake and Databricks’ marketing teams really brings this to light) when it comes to building your data stack.
Regardless of what side you take, you quite literally cannot build a modern data platform without investing in cloud storage and compute.
Below, we highlight some leading options in today’s cloud warehouse, lake, or [insert your own variation here] landscape:
- Snowflake – The original cloud data warehouse, Snowflake provides a flexible payment structure for data teams, as users pay separate fees for computing and storing data.
- Google BigQuery – Google’s cloud warehouse, BigQuery, provides a serverless architecture that allows for quick querying due to parallel processing, as well as separate storage and compare for scalable processing and memory.
- Amazon Redshift – Amazon Redshift, one of the most widely used options, sits on top of Amazon Web Services (AWS) and easily integrates with other data tools in the space.
- Firebolt – A SQL-based cloud data warehouse that claims its performance is up to 182 times faster than other options, as the warehouse handles data in a lighter way thanks to new techniques for compression and data parsing.
- Microsoft Azure – Microsoft’s cloud computing entrant in this list common among teams that leverage heavy Windows integrations.
- Amazon S3 – An object storage service for structured and unstructured data, S3 gives you the compute resources to build a data lake from scratch.
- Databricks – Databricks, the Apache Spark-as-a-service platform, has pioneered the data lakehouse, giving users the options to leverage both structured and unstructured data and offers the low-cost storage features of a data lake.
- Dremio – Dremio’s data lake engine provides analysts, data scientists, and data engineers with an integrated, self-service interface for data lakes.
Data Transformation and Modeling
Data transformation and modeling are often used interchangeably, but they are two very different processes. When you transform your data, you are taking raw data and cleaning it up with business logic to get the data ready for analysis and reporting. When you model data, you are creating a visual representation of data for storage in a data warehouse.
Below, we share a list of common tools that allow data engineers to transform and model their data:
- dbt – Short for data build tool, is the open source leader for transforming data once it’s loaded into your warehouse.
- Dataform – Now part of the Google Cloud, Dataform allows you to transform raw data from your warehouse into something usable by BI and analytics tools.
- Sequel Server Integration Services (SSIS) – Hosted by Microsoft, SSIS allows your business to extract and then transform that data from a wide variety of sources which you can then later use to load into your destination of choice.
- Custom Python code and Apache Airflow – Before the rise of tools like dbt and Dataform, data engineers commonly wrote their transformations in pure Python. While it might be tempting to continue using custom code to transform your data, it does increase the chances of errors being made as the code is not easily replicable and must be rewritten every time a process takes place.
The data transformation and modeling layer turns data into something a little more useful, readying it for the next stage in its journey: analytics.
Business Intelligence (BI) and Analytics
The data you have collected, transformed, and stored serves your business is no good if your employees can’t use it.
If the data platform was a book, the BI and analytics layer would be the cover, replete with an engaging title, visuals, and summary of what the data is actually trying to tell you. In fact, this layer is often what end-users think of when they picture a data platform, and for good reason: it makes data actionable and intelligent, and without it, your data lacks meaning.
Tableau is a leading business intelligence tool that gives data analysts and scientists the capability to build dashboards and other visualizations that power decision making. Image courtesy of Tableau
Below, we outline some popular BI solutions among top data teams:
- Looker – A BI platform that is optimized for big data and allows members of your team to easily collaborate on building reports and dashboards.
- Tableau – Often referred to as a leader in the BI industry, it has an easy-to-use interface.
- Mode – A collaborative data science platform that incorporates SQL, R, Python, and visual analytics in one single UI.
- Power BI – A Microsoft-based tool that easily integrates with Excel and provides self-service analytics for everyone on your team.
This list is by no means extensive, but it will get you started on your search for the right BI layer for your stack.
With data pipelines becoming increasingly complex and organizations relying on data to drive decision-making, the need for this data being ingested, stored, processed, analyzed, and transformed to be trustworthy and reliable has never been higher. Simply put, organizations can no longer afford for data to be down i.e., partial, inaccurate, missing, or erroneous.
By applying the same principles of application observability and infrastructure design to our data platforms, data teams can ensure data is usable and actionable. In our opinion, it’s often worse to make decisions based on bad data than to have no data at all.
Your data observability layer must be able to monitor and alert for the following pillars of observability:
- Freshness: is the data recent? When was the last time it was generated? What upstream data is included/omitted?
- Distribution: is the data within accepted ranges? Is it properly formatted? Is it complete?
- Volume: has all the data arrived?
- Schema: what is the schema, and how has it changed? Who has made these changes and for what reasons?
- Lineage: for a given data asset, what are the upstream sources and downstream assets which are impacted by it? Who are the people generating this data, and who is relying on it for decision-making?
An effective, proactive data observability solution will connect to your existing stack quickly and seamlessly, providing end-to-end lineage that allows you to track downstream dependencies. Additionally, it will automatically monitor your data-at-rest without requiring the extraction of data from your data store. This approach ensures that you meet the highest levels of security and compliance requirements and scale to the most demanding data volumes.
When building a data platform, most leaders task themselves with choosing (or building) a data catalog, and in our opinion, this approach is no longer sufficient.
Don’t get me wrong: data catalogs are important, and modern data teams need a reliable, scalable way to document and understand critical data assets. But as data becomes increasingly complex and real-time, the processes and technologies underlying this layer of the platform need to evolve, too.
Where many traditional data catalogs fall short (i.e., often manual, poor scalability, lack of support for unstructured data, etc.), data discovery picks up the slack. If data catalogs are a map, data discovery is your smartphone’s navigation system, constantly being updated and refined with the latest insights and information.
At a bare minimum, data discovery should address the following needs:
- Self-service discovery and automation: Data teams should be able to easily leverage their data catalog without a dedicated support team. Self-service, automation, and workflow orchestration for your data tooling removes silos between stages of the data pipeline, and in the process, making it easier to understand and access data. Greater accessibility naturally leads to increased data adoption, reducing the load for your data engineering team.
- Scalability as data evolves: As companies ingest more and more data and unstructured data becomes the norm, the ability to scale to meet these demands will be critical for the success of your data initiatives. Data discovery leverages machine learning to gain a bird’s eye view of your data assets as they scale, ensuring that your understanding adapts as your data evolves. This way, data consumers are set up to make more intelligent and informed decisions instead of relying on outdated documentation or worse – gut-based decision making.
- Real-time visibility into data health: Unlike a traditional data catalog, data discovery provides real-time visibility into the data’s current state, as opposed to its “cataloged” or ideal state. Since discovery encompasses how your data is being ingested, stored, aggregated, and used by consumers, you can glean insights such as which data sets are outdated and can be deprecated, whether a given data set is production-quality, or when a given table was last updated.
- Support for governance and warehouse/lake optimization: From a governance perspective, querying and processing data in the lake often occurs using a variety of tools and technologies (Spark on Databricks for this, Presto on EMR for that, etc.), and as a result, there often isn’t a single, reliable source of truth for reads and writes (like a warehouse provides). A proper data discovery tool can serve as that central source of truth.
Data discovery empowers data teams to trust that their assumptions about data match reality, enabling dynamic discovery and a high degree of reliability across your data infrastructure, regardless of domain.
Build or buy your 6-layer data platform? It depends.
Building a data platform is not an easy task, and there is a lot to take into consideration that should not be overlooked when doing so. One of the biggest challenges our customers face is whether they should just build certain layers in-house, invest in SaaS solutions, or explore the wide world of open source.
Our answer? Unless you’re Airbnb, Netflix, or Uber, you generally need to include all three.
There are pros and cons to each of these solutions, but your decision will depend on many factors, including but not limited to:
- The size of your data team. Data engineers and analysts already have enough on their plates, and requiring them to build an in-house tool might cost more time and money than you think. Simply put, lean data teams do not have the time to get new team members up to speed with in-house tools, let alone build them. Investing in easily configurable, automated, or popular solutions (i.e., open source or low-code/no-code SaaS) is becoming increasingly common among non-Uber/Airbnb/Netflix data teams. Diogo Ribeiro, Vice President of Analytics at ThousandEyes, a leading cloud intelligence platform, summarized this tradeoff better than we ever could: according to him, in-house tools are worth it when your data engineering team has the bandwidth to build applications on top of your data. However, if data engineers are spending most of their time building and maintaining data pipelines, it probably makes more sense to buy solutions that decreases their load and frees them up for more interesting work.
- The amount of data your organization stores and processes. When choosing a solution it is important to select one that will scale with your business. Chances are, it doesn’t make sense for a lone wolf data analyst at a 20-person company to go with a $10K per year transformation solution if all you need is a few lines of code to do the job.
- Your data team’s budget. If your team is working with a limited budget but many hands, then open source options might be a good fit for you. However, keep in mind you are typically on your own when it comes to setting up and implementing open source tools across your data stack, frequently relying on other members of the community or the project creators themselves to build out and maintain features. When you take into account that only about 2 percent of projects see growth after their first few years, you have to be careful with what you fork.
Regardless of which path you choose, building out these layers — in the right order — will give you the foundation to grow and scale, and most importantly, deliver insights and products your company can trust.
After all, sometimes the simplest way is the best way.
And if you’re interested in learning more about Data Observability, reach out to the rest of the Monte Carlo team.