What is a Data Platform? And How to Build One
For most organizations, building a data platform is no longer a nice-to-have, but a necessity. Companies distinguish themselves based on their ability to glean actionable insights from their data to improve the customer experience, increase revenue, or even define their brand
However, there isn’t yet a blueprint for organizations defining what your data platform should contain and when you should invest in each layer.
While every organization’s data platform approach will vary based on the industry and the size of their company, this quick and dirty guide will help lay out that blueprint for how you can build a modern data platform.
In this post, I’ll cover:
- What is a data platform?
- The six must-have layers of a modern data platform
- Data Storage and Processing
- Data Ingestion
- Data Transformation and Modeling
- Business Intelligence (BI) and Analytics
- Data Observability
- Data Discovery
- Data platform vs. customer data platform
- Build or buy your 6-layer data platform? It depends.
What is a data platform?
In the past, data platforms were viewed as a means to an end versus the final product built by your data team. Now, companies are taking a page from software engineering teams and beginning to treat data platforms like production-grade software, dedicating valuable team resources to maintaining and optimizing them.
A data platform is a central repository and processing house for all of an organization’s data. A data platform handles the collection, cleansing, transformation, and application of data to generate business insights. Data-first companies have embraced data platforms as an effective way to aggregate, operationalize, and democratize data at scale across the organization.
To make things a little easier, I’ve outlined the six must-have layers you need to include in your data platform and the order in which many of the best teams choose to implement them.
The six must-have layers of a modern data platform
Second to “how do I build my data platform?”, the most frequent data platform question I get from customers is “where do I start?”
The “right” data stack will look vastly different for a 5,000-person e-commerce company than it will for a 200-person startup in the FinTech space, but there are a few core layers that all data platforms must have in one shape or another.
Keep in mind: just as you can’t build a house without a foundation, frame, and roof, at the end of the day, you can’t build a true data platform without each of these 6 layers. Below, we share what the “basic” data platform looks like and list some hot tools in each space (you’re likely using several of them):
Data Storage and Processing
The first layer? Data storage and processing layer – as you are need a place to store your data and process it before it is later transformed and sent off for analysis. It becomes especially important to have a data storage and processing layer when you start to deal with large amounts of data and are holding that data for a long period of time and need it to be readily available for analysis.
With companies moving their data platforms to the cloud, the emergence of cloud-native data warehouses, data lakes, and even data lakehouses have taken over the market, offering more accessible and affordable options for storing data relative to many on-premises solutions.
Whether you choose to go with a data warehouse, data lake or some combination of both is entirely up to the needs of your business.
Recently, there’s been a lot of discussion around whether to go with open source or closed source solutions (the dialogue between Snowflake and Databricks’ marketing teams really brings this to light) when it comes to building your data platform.
Regardless of which side you take, you quite literally cannot build a modern data platform without investing in cloud storage and compute.
Below, we highlight some leading options in today’s cloud warehouse, lake, or [insert your own variation here] landscape:
- Snowflake – The original cloud data warehouse, Snowflake provides a flexible payment structure for data teams, as users pay separate fees for computing and storing data.
- Google BigQuery – Google’s cloud warehouse, BigQuery, provides a serverless architecture that allows for quick querying due to parallel processing, as well as separate storage and compare for scalable processing and memory.
- Amazon Redshift – Amazon Redshift, one of the most widely used options, sits on top of Amazon Web Services (AWS) and easily integrates with other data tools in the space.
- Firebolt – A SQL-based cloud data warehouse that claims its performance is up to 182 times faster than other options, as the warehouse handles data in a lighter way thanks to new techniques for compression and data parsing.
- Microsoft Azure – Microsoft’s cloud computing entrant in this list common among teams that leverage heavy Windows integrations.
- Amazon S3 – An object storage service for structured and unstructured data, S3 gives you the compute resources to build a data lake from scratch.
- Databricks – Databricks, the Apache Spark-as-a-service platform, has pioneered the data lakehouse, giving users the options to leverage both structured and unstructured data and offers the low-cost storage features of a data lake.
- Dremio – Dremio’s data lake engine provides analysts, data scientists, and data engineers with an integrated, self-service interface for data lakes.
As is the case for nearly any modern data platform, there will be a need to ingest data from one system to another.
As data infrastructures become increasingly complex, data teams are left with the challenging task of ingesting structured and unstructured data from a wide variety of sources. This is often referred to as the extraction and loading stage of Extract Transform Load (ETL) and Extract Load Transform (ELT).
Below, we outline some popular tools in the space:
- Fivetran – A leading enterprise ETL solution that manages data delivery from the data source to the destination.
- Singer – An open source tool for moving data from a source to a destination.
- Stitch – A cloud-based open source platform that allows you to rapidly move data from a source to a destination.
- Airbyte – An open source platform that easily allows you to sync data from applications.
- Apache Kafka – An open source event streaming platform to handle streaming analytics and data ingestion
Even with the prevalence of ingestion tools available on today’s market, some data teams choose to build custom code to ingest data from internal and external sources, and many organizations even build their own custom frameworks to handle this task.
Orchestration and workflow automation, featuring such tools as Apache Airflow, Prefect, and Dagster, often folds into the ingestion layer, too. Orchestration takes ingestion a step further by taking siloed data, combining it with other sources, and making it available for analysis.
I would argue, though, orchestration can be (and should be) weaved into the data platform after you handle the storage, processing, and business intelligence layers. You can’t orchestrate without an orchestra of queryable data, after all!
Data Transformation and Modeling
Data transformation and modeling are often used interchangeably, but they are two very different processes.
When you transform your data, you are taking raw data and cleaning it up with business logic to get the data ready for analysis and reporting. When you model data, you are creating a visual representation of data for storage in a data warehouse.
Below, we share a list of common data transformation and modeling tools that data engineers rely on:
- dbt – Short for data build tool, is the open source leader for transforming data once it’s loaded into your warehouse.
- Dataform – Now part of the Google Cloud, Dataform allows you to transform raw data from your warehouse into something usable by BI and analytics tools.
- Sequel Server Integration Services (SSIS) – Hosted by Microsoft, SSIS allows your business to extract and then transform that data from a wide variety of sources which you can then later use to load into your destination of choice.
- Custom Python code and Apache Airflow – Before the rise of tools like dbt and Dataform, data engineers commonly wrote their transformations in pure Python. While it might be tempting to continue using custom code to transform your data, it does increase the chances of errors being made as the code is not easily replicable and must be rewritten every time a process takes place.
The data transformation and modeling layer turns data into something a little more useful, readying it for the next stage in its journey: analytics.
Business Intelligence (BI) and Analytics
The data you have collected, transformed, and stored serves your business no good if your employees can’t use it.
If the data platform was a book, the BI and analytics layer would be the cover, replete with an engaging title, visuals, and summary of what the data is actually trying to tell you. In fact, this layer is often what end-users think of when they picture a data platform, and for good reason: it makes data actionable and intelligent, and without it, your data lacks meaning.
Tableau is a leading business intelligence tool that gives data analysts and scientists the capability to build dashboards and other visualizations that power decision making. Image courtesy of Tableau
Below, we outline some popular BI solutions among top data teams:
- Looker – A BI platform that is optimized for big data and allows members of your team to easily collaborate on building reports and dashboards.
- Tableau – Often referred to as a leader in the BI industry, it has an easy-to-use interface.
- Mode – A collaborative data science platform that incorporates SQL, R, Python, and visual analytics in one single UI.
- Power BI – A Microsoft-based tool that easily integrates with Excel and provides self-service analytics for everyone on your team.
This list is by no means extensive, but it will get you started on your search for the right BI layer for your stack.
With data pipelines becoming increasingly complex and organizations relying on data to drive decision-making, the need for this data being ingested, stored, processed, analyzed, and transformed to be trustworthy and reliable has never been higher. Simply put, organizations can no longer afford data downtime i.e., partial, inaccurate, missing, or erroneous data. Data observability is an organization’s ability to fully understand the health of the data in their data ecosystem. It eliminates data downtime by applying best practices learned from DevOps to data pipelines, ensuring that the data is usable and actionable.
Your data observability layer must be able to monitor and alert for the following pillars of observability:
- Freshness: is the data recent? When was the last time it was generated? What upstream data is included/omitted?
- Distribution: is the data within accepted ranges? Is it properly formatted? Is it complete?
- Volume: has all the data arrived?
- Schema: what is the schema, and how has it changed? Who has made these changes and for what reasons?
- Lineage: for a given data asset, what are the upstream sources and downstream assets which are impacted by it? Who are the people generating this data, and who is relying on it for decision-making?
An effective, proactive data observability solution will connect to your existing data platform quickly and seamlessly, providing end-to-end lineage that allows you to track downstream dependencies.
Additionally, it will automatically monitor your data-at-rest without requiring the extraction of data from your data store. This approach ensures you meet the highest levels of security and compliance requirements and scale to the most demanding data volumes.
When building a data platform, most leaders task themselves with choosing (or building) a data catalog, and in our opinion, this approach is no longer sufficient.
Don’t get me wrong: data catalogs are important, and modern data teams need a reliable, scalable way to document and understand critical data assets. But as data becomes increasingly complex and the need for real-time access to reliable data becomes a priority, the processes and technologies underlying this layer of the data platform need to evolve, too.
Where many traditional data catalogs fall short (i.e., often manual, poor scalability, lack of support for unstructured data, etc.), data discovery picks up the slack. If data catalogs are a map, data discovery is your smartphone’s navigation system, constantly being updated and refined with the latest insights and information.
At a bare minimum, data discovery should address the following needs:
- Self-service discovery and automation: Data teams should be able to easily leverage their data catalog without a dedicated support team. Self-service, automation, and workflow orchestration for your data tooling removes silos between stages of the data pipeline, and in the process, making it easier to understand and access data. Greater accessibility naturally leads to increased data adoption, reducing the load for your data engineering team.
- Scalability as data evolves: As companies ingest more and more data and unstructured data becomes the norm, the ability to scale to meet these demands will be critical for the success of your data initiatives. Data discovery leverages machine learning to gain a bird’s eye view of your data assets as they scale, ensuring that your understanding adapts as your data evolves. This way, data consumers are set up to make more intelligent and informed decisions instead of relying on outdated documentation or worse – gut-based decision making.
- Real-time visibility into data health: Unlike a traditional data catalog, data discovery provides real-time visibility into the data’s current state, as opposed to its “cataloged” or ideal state. Since discovery encompasses how your data is being ingested, stored, aggregated, and used by consumers, you can glean insights such as which data sets are outdated and can be deprecated, whether a given data set is production-quality, or when a given table was last updated.
- Support for governance and warehouse/lake optimization: From a governance perspective, querying and processing data in the lake often occurs using a variety of tools and technologies (Spark on Databricks for this, Presto on EMR for that, etc.), and as a result, there often isn’t a single, reliable source of truth for reads and writes (like a warehouse provides). A proper data discovery tool can serve as that central source of truth.
Data discovery empowers data teams to trust that their assumptions about data match reality, enabling dynamic discovery and a high degree of reliability across your data infrastructure, regardless of domain.
Data platform vs. customer data platform
Data platforms can sometimes be confused with a customer data platform. It’s important to note a customer data platform solely deals with customer-related data.
The Customer Data Platform Institute defines a customer data platform as a “packaged software that creates a persistent, unified customer database accessible to other systems.” Customer data platforms consist of data from various first, second, or third-party sources including web forms, web page activity, social media activity, and behavioral data.
Customer data platforms and data platforms shouldn’t be used interchangeably. They are two entirely different tools, with very different purposes. Customer data platforms exist to create a single source of truth for a customer profile, helping businesses piece together disparate behaviors and information about a given customer to improve experiences or send more targeting communications and advertisements.
Data platforms, on the other hand, aggregate all of a company’s analytical data – both customer data and operational data – to help the business drive better decision making and power digital services.
Build or buy your 6-layer data platform? It depends.
Building a data platform is not an easy task, and there is a lot to take into consideration. One of the biggest challenges our customers face when standing up their data platform is whether they should just build certain layers in-house, invest in SaaS solutions, or explore the wide world of open source.
Our answer? Unless you’re Airbnb, Netflix, or Uber, you generally need to include all three.
There are pros and cons to each of these solutions, but your decision will depend on many factors, including but not limited to:
- The size of your data team. Data engineers and analysts already have enough on their plates, and requiring them to build an in-house tool might cost more time and money than you think. Simply put, lean data teams do not have the time to get new team members up to speed with in-house tools, let alone build them. Investing in easily configurable, automated, or popular solutions (i.e., open-source or low-code/no-code SaaS) is becoming increasingly common among non-Uber/Airbnb/Netflix data teams.
- The amount of data your organization stores and processes. When choosing a solution it is important to select one that will scale with your business. Chances are, it doesn’t make sense for a lone wolf data analyst at a 20-person company to go with a $10K per year transformation solution if all you need is a few lines of code to do the job.
- Your data team’s budget. If your team is working with a limited budget but many hands, then open source options might be a good fit for you. However, keep in mind you are typically on your own when it comes to setting up and implementing open-source tools across your data stack, frequently relying on other members of the community or the project creators themselves to build out and maintain features. When you take into account that only about 2 percent of projects see growth after their first few years, you have to be careful with what you fork.
- Who’s going to use the tool? If the tool is meant for data engineers, it might make sense to build the tool. If it’s a blend of stakeholders from across the organization, you might be better off buying a user-friendly and collaborative tool.
- What data problems is the tool solving? If the use case is highly specific to your business, it likely makes sense to build the solution in-house. If the tool is solving a common industry problem, you might benefit from the expertise and experience of a third-party vendor.
- What are your data governance requirements? With data governance being top of mind for most organizations in 2022, it is crucial that the solution you choose can meet your businesses’ needs and comply with regulations such as CCPA and GDPR. Some companies that deal with highly sensitive data are more comfortable building their own solutions to ensure compliance across multiple jurisdictions.
Regardless of which path you choose, building out these layers will give you the foundation to grow and scale and, most importantly, deliver insights and products your company can trust.
And if you’re interested in learning more about Data Observability, reach out to the rest of the Monte Carlo team.