The Future of Data Management: 8 Fast Growing Trends
There is a rapidly growing but well-defined data engineering echo-chamber shaped by a few industry giants and outspoken thought leaders.
Simultaneously, it can be hard to get your nose above the grindstone with the pace and demands placed on your data team.So, how do you go beyond the buzz and above the daily grind to look at the bigger picture of data management?
We define data management as the collection, storage, security, access, usage, and deprecation of data no matter where it is in the pipeline.
Data engineers are most frequently tasked with overseeing the early phases of data management, namely the collection and preparation or extraction (E), transformation (T), load (L)., but the biggest data management trends are taking place across the entire workflow, requiring data leaders to be increasingly proactive and collaborative in order to realize the full value of their data products.
In many ways, keeping up with data management is like readying yourself for the first frost of fall: the beginning of the season may still seem far off, but it makes little sense to delay preparing for these changes when you can see that change is coming.
We believe the future of data management will be red-hot in these eight areas:
- Increasing regulatory requirements
- Data governance
- Data mesh
- Data access governance
- Data transformation or ELT
- Data streaming/transportation
- AI models
- Data quality and data observability
Increasing regulatory requirements
In the past, privacy infringements and lax data protection usually cost companies little more than a slap on the wrist; when fines were incurred, their value was generally fairly low. Those days are long gone, as attitudes toward data protection have changed significantly – and that’s a good thing.
The onus is now very much on corporations to safeguard customer data, payment details, passwords, and so on. Companies that fail to take adequate steps to fulfill compliance, whether intentionally or not, now face stiff penalties.
In 2016, the EU introduced the General Data Protection Regulation (GDPR) with the aim of regulating data protection and privacy throughout Europe. It also states that companies must report data breaches within a narrow window of 72 hours or face strict disciplinary action.
In 2019, Google was fined €50 million for breaching GDPR regulations. In 2021, Amazon Europe was fined €746 million for failing to comply with data processing principles. That’s an increase of more than an order of magnitude in just two years.
It’s worth pointing out that the maximum fine for a GDPR breach can reach 4% of annual global turnover so, theoretically, has no upper limit. What’s even more concerning is that notable violations are being committed by big companies like H&M, WhatsApp, and British Airways, many of which have entire teams dedicated to compliance.
It’s not just the severity of data regulations that is increasing; the regulatory regime is becoming increasingly localized down to the state and even municipality level, too
In 2018, California passed the California Consumer Privacy Act (CCPA) and it became effective in early 2020. Although the CCPA didn’t go quite as far as GDPR, it’s still a sign of things to come in the US; Virginia, for example, passed its Consumer Data Protection Act (CDPA) in 2021.
It also seems like every country is introducing a new data sovereignty law, requiring that data collected within its borders stay within its borders, be subject to its regulations, or both. Germany, Brazil, and other countries are getting into the act and
pressuring encouraging major cloud providers to create sovereign clouds or separate cloud instances specifically for and in their country.
As a result of the increasingly severe and fragmented data regulatory regime, the role of data governance has become increasingly important requiring data leaders to become more proactive. Effective data governance now means figuring out which data will be gathered, how and where it will be stored, who can own and access it, and so on.
Data governance can no longer be the purview of the well-meaning but overwhelmed data steward. It needs to be infused across the data team with domain-first principles and standards. Think data mesh. Speaking of…
Many consider the data mesh to be the future of data management. With a data mesh, the monolithic data lake is broken down and its components are decentralized.
The result? Distributed data products, owned by independent cross-functional teams, oriented around data domains.
One of the aims of the data mesh is to empower individual teams, creating a self-serve data platform, so they own information relating to their area of the business.
By embedding the principle of “data as a product”, data is no longer seen as the responsibility of “the data team” but something that the entire business is actively involved in.
Data access governance
Implementing a data mesh can also help with data access governance which is, in a nutshell, restricting access only to those who need it as well as applying the right security measures and preventing breaches.
Gone are the days of company-wide access to databases via a password stuck up on a Post-It somewhere…
“The most common challenge we hear,” according to Matthew Carroll, CEO of Immuta, a data access platform, “is that organizations are trying to innovate, trying to move their business forward, but there’s a disconnect between IT and the business and they are forced into this decision between being compliant or providing fast access to the data.”
Data transformation or ELT
OK, enough about data regulations, access, and governance. Let’s talk about a topic a bit closer to home for most data engineers: data transformation.
In simple terms, data transformation refers to the process of converting data from one format to another. Traditionally, this was done in a very structured manner via hard-coded data pipelines. Now with, automated data pipeline tools like Fivetran and transformation tools like dbt Labs, extraction and transformation no longer requires a single line of code.
This has allowed for data transformation to take place at different stages of the pipeline and less technical data professionals to become more heavily involved. With this increased flexibility comes questions about data usability, quality and modeling however.
As the name suggests, data streaming is the transmission of a literal stream of data from one place to another. In an ideal world, this process happens in real time. Otherwise, you risk colleagues making decisions based on out of date or otherwise inaccurate data.
In 2022, we’re seeing applications that are capable of processing data streams in near-real time, producing advanced reports, and even connecting machine learning models. Streaming is becoming the foundational part of how to build new modern applications and this has big implications for the future of data management.
Of course, stream processing requires low latency and the right infrastructure in place to support large quantities of data being transmitted. Apache Kafka is the most well-known player in the streaming space, but others to watch out for include Redpanda, Quix, and Datastax.
Historically, many facets of data management have been handled manually – an individual, or team of individuals, working away on dense spreadsheets and storing that data locally. Today, artificial intelligence is capable of identifying and eliminating incomplete, duplicate, or irrelevant data that’s stored across multiple clouds.
You don’t need to be a data scientist to harness the power of AI to get the most from your data. Tools like Monkeylearn, Trifacta, and Polymer make data wrangling dead simple and will only become more widespread.
The more interesting trend is that more companies are using AI models to generate synthetic data that is statistically equivalent to their actual data.
Why would they do this? This production-like data for developers keeps governance and compliance folks inside an organization happy. Using AI models to generate synthetic data is a trend to watch, with companies like Tonic.ai and Gretel.ai growing fast.
Data quality and data observability
The value of making decisions based on data is negligible if those decisions aren’t based on good data. But what exactly constitutes good data? Completeness, accuracy, consistency, and validity are all important as are the five pillars of data observability including data freshness, schema, volume, distribution, and lineage.
Data observability doesn’t (just) refer to being able to see your data, but also to understand the health of that data. The emphasis for data observability platforms is on empowering data teams to identify and resolve issues relating to data quality and reducing data downtime.
Doing this effectively requires automated machine learning monitors to detect data incidents at scale, fine-tuned alerting to intelligently notify data team owners and impacted downstream consumers when incidents occur, field-level lineage to enable data engineering teams to triage and resolve incidents. A complete data observability platform should also help data teams prevent bad data altogether by identifying potential at-risk data sets and queries.. In other words, it has a lot in common with software reliability engineering. The only difference is that it’s data being analyzed rather than pieces of code.
Be ready for the future of data management
Although we talk about to the future of data management in this post, that’s not quite accurate; the “future” of data management is already here. The good news is that there are more ways to streamline and automate processes that once than ever before. Data leaders that take the time to understand and embrace these cutting edge data management practices are already well on their way, for the future.
If you’re worried about being left behind, book a demo with us to see how you can level up your data management with end-to-end data observability.