Skip to content
Data Reliability, Case Studies Updated May 05 2025

Data Vault Architecture: Everything You Need to Know Before You Build

Data Vault Architecture, Data Quality Challenges, And How To Solve Them
AUTHOR | Michael Segner

Over the past several years, data warehouses have evolved dramatically, but that doesn’t mean the fundamentals underpinning sound data architecture needs to be thrown out the window.

In fact, with increasingly strict data regulations like GDPR and a renewed emphasis on optimizing technology costs, we’re now seeing a revitalization of “Data Vault 2.0” data modeling. 

While data vault has many benefits, it is a sophisticated and complex methodology that can present challenges to data quality. In this blog post we’ll dive into data vault architecture; challenges and best practices for maintaining data quality; and how data observability can help.

What is data vault architecture?

Data vault architecture is a data warehouse design methodology that prioritizes adaptability and historical accuracy over query performance. Created by Dan Linstedt in the 1990s, this approach emerged from the need to handle rapidly changing business environments and increasing data complexity that traditional warehousing methods struggled to accommodate.

The core principle of data vault centers on building a flexible, auditable foundation that can absorb new data sources and business rule changes without disrupting existing structures. Rather than optimizing for specific reports or departments, data vault creates an enterprise-wide integration layer that preserves all data relationships and history. This methodology enables organizations to respond quickly to new requirements while maintaining a complete audit trail of how their data has changed over time.

Three key components of data vault architecture

Data vault’s power comes from its elegant three-part architecture that separates business entities, relationships, and descriptive attributes into distinct entity types. Understanding how Hubs, Links, and Satellites work together is essential for implementing an effective data vault solution.

data vault model example
A data vault model example. Image courtesy of the author.

Hubs

Hubs form the foundation of any data vault model by storing unique business keys that identify core business entities. Each Hub contains only the business key, a hash key for performance, load timestamp, and record source, nothing more. This minimal structure ensures that Hubs remain stable even as business rules and descriptive attributes change over time.

The business key focus is what makes Hubs so powerful and enduring. A customer Hub might store customer numbers, a product Hub contains product codes, and a location Hub holds store identifiers. These keys rarely change throughout the lifetime of the business entity they represent.

By stripping away all descriptive information and storing only identifiers, Hubs achieve remarkable stability. New source systems can be integrated without modifying Hub structures, and business transformations don’t require Hub redesigns. This immutability makes Hubs the perfect foundation for building an enterprise-wide integration layer.

Links capture the relationships between business entities by connecting two or more Hubs together. A Link table contains the hash keys from its related Hubs, its own hash key, plus the standard metadata fields. Like Hubs, Links contain no descriptive attributes; they purely represent that a relationship exists.

This design provides extraordinary modeling flexibility compared to traditional foreign key relationships. Links can connect multiple Hubs (not just two), representing complex business relationships like “customer purchased product at location using promotion.” They can also capture the same relationship from multiple source systems without conflict.

The relationship-only design means Links can be added without modifying existing structures. New business processes that create previously untracked relationships simply require new Link tables. This flexibility allows the data model to grow organically with the business while maintaining referential integrity across all connections.

Satellites

Satellites store all the descriptive attributes about Hubs and Links, providing context and detail about business entities and their relationships. Each Satellite hangs off a single Hub or Link and contains descriptive attributes from one source system. Multiple Satellites can attach to the same Hub or Link, segregating data by source or subject area.

The true power of Satellites lies in their historization capability. Every Satellite record includes effective dates and end dates, creating a complete timeline of how attributes changed. When a customer’s address changes, the old record remains with an end date, and a new record begins with the updated information.

This design enables several critical capabilities for modern data management. Source system changes can be isolated to specific Satellites without impacting others. Performance optimization can target frequently-accessed Satellites while archiving historical ones. Most importantly, the separation of volatile attributes from stable keys means that the vast majority of data warehouse changes only affect Satellites, not the core Hub-Link structure.

The top six key benefits of a data vault architecture

Data vault architecture has emerged as a powerful methodology for building enterprise data warehouses that can adapt to changing business needs while maintaining data integrity and historical accuracy. The following six benefits demonstrate why companies have chosen data vault as their foundation for scalable, auditable, and business-aligned data management.

Scalability and flexibility

Data vault’s hub-and-spoke architecture provides exceptional scalability by allowing organizations to incrementally expand their data warehouse without disrupting existing structures. When new data sources need integration, teams can simply add new Satellite tables to existing Hubs, or create new Hub and Link structures for entirely new subject areas. This plug-and-play capability dramatically reduces the time and effort required to onboard new data sources. Organizations can respond quickly to changing analytical needs without compromising the integrity of existing data structures.

Pie Insurance, a leading small business insurtech, leverages a data vault 2.0 architecture (with some minor deviations) to achieve their data integration objectives around scalability and use of metadata.

“A data vault data model is intended to be a physical implementation of the organization’s business model so it becomes a standard place to plug-in data from multiple sources. As new data is added into our data warehouse, we are able to plug in the data to our model by adding Satellite tables. Or as new subject areas become in-scope, we can add new Hub and Link tables based on our business model,” said Ken Wood, staff data engineer in data architecture, Pie.

“The other advantage is because we follow a standard design, we are able to generate a lot of our code using code templates and metadata. The metadata contains our data mappings and the code templates contain the expected structure of our ETL code scripts/files,” he said.

Complete historical tracking

This flexible architecture also enables data vault to maintain complete historical records through its Satellite table design. Every change to every attribute is captured over time, creating an immutable audit trail. Each record includes load timestamps and source system information for precise tracking.

Unlike traditional approaches that may overwrite data or maintain limited history, data vault preserves everything. Organizations can analyze patterns over time and reconstruct exact data states from any point in the past. This capability proves invaluable for trend analysis, compliance reporting, and troubleshooting data issues.

The temporal depth goes further than simple record keeping. Business relationships captured in Link tables also maintain full history, showing how connections between entities have changed. This complete picture enables sophisticated analytics that reveal not just what changed, but how business relationships shifted over time.

Business-aligned data organization

The historical tracking capabilities work seamlessly with data vault’s business-centric modeling approach, where Hubs represent core business entities, Links capture relationships, and Satellites store descriptive attributes using familiar business terminology. This alignment makes the data warehouse a true reflection of how the organization operates, allowing both technical staff and business users to navigate data intuitively. The shared vocabulary reduces miscommunication, accelerates project delivery, and ensures that the data model remains relevant as the business grows.

Rapid development through automation

This consistent business-aligned structure creates opportunities for significant automation in the development process. The predictable patterns of Hubs, Links, and Satellites enable organizations to create reusable templates and code generators. Many organizations report reducing development time by 50-80% through this standardization.

Teams can focus on implementing business logic rather than writing repetitive ETL code. The standardized patterns also improve code quality and reduce bugs since the same proven templates are used repeatedly. This consistency makes maintenance easier and helps new developers quickly become productive.

Built-in audit and compliance readiness

The automation and standardization naturally extend to data vault’s audit capabilities, where every record automatically captures its load timestamp, source system identifier, and business key information without additional development effort. Data vault preserves all historical states through insert-only operations, allowing organizations to demonstrate compliance with data retention requirements and reconstruct any historical report for regulatory review. In regulated industries, this built-in ability to prove data providence and accuracy transforms compliance from a burden into a standard feature.

Foundation for agile analytics

All these benefits culminate in a data vault serving as a stable foundation for analytical flexibility. The raw vault layer remains unchanged while multiple analytical structures can be built on top. This separation of concerns is fundamental to data vault’s value proposition.

Teams can create traditional dimensional models, denormalized reporting tables, or direct query access as needed. Different departments can have their own analytical views without interfering with each other. The same vault data can power executive dashboards, operational reports, and data science initiatives simultaneously.

This approach supports both structured business intelligence and exploratory self-service analytics without compromise. As business questions change, new analytical structures can be quickly created without rebuilding the underlying foundation. Existing reports continue to function while new capabilities are added, eliminating the disruption that plagues traditional data warehouse projects.

How do you implement a data vault architecture?

While deployments will vary based on organizational needs and technical environments, Pie Insurance’s data vault implementation provides a clear example of how to structure a complete data pipeline ecosystem. Their architecture consists of four conceptual layers that progressively refine data from raw source extracts to business-ready information.

Ingestion layer

The ingestion layer serves as the entry point for all source system data, maintaining it in its most raw form possible. Landing zones use AWS S3 buckets to receive source files exactly as they arrive from operational systems. These files then move to staging tables in Snowflake, where raw data is stored in VARIANT columns that preserve the original structure without transformation.

This approach ensures complete auditability and allows for reprocessing if business rules change. The VARIANT column format in Snowflake provides flexibility to handle semi-structured data formats like JSON or XML without predefined schemas. By keeping data truly raw at this stage, organizations maintain the ability to adapt to changing interpretations of source data.

Curation layer

The curation layer organizes raw data into the formal data vault structure while maintaining its business context. The raw data vault within Snowflake applies minimal transformations to map source data into Hub, Satellite, and Link tables following Data Vault 2.0 methodology. These transformations include only technical necessities like hash key generation and metadata addition.

The key differentiator in Pie’s approach is designing their data vault as a physical implementation of their business data model rather than mirroring source system structures. This creates a single unified model that all sources conform to, regardless of their original format or structure. New source systems map to this existing business model rather than creating new parallel structures.

Transformation layer

The transformation layer applies business logic to create analytically useful datasets while maintaining data vault principles. The business vault contains pre-transformed data that follows established business rules but still maintains the Hub-Link-Satellite structure. This allows complex business logic to be consistently applied while preserving auditability.

The information warehouse represents a departure from pure data vault, implementing dimensional star schemas optimized for reporting. This hybrid approach acknowledges that while data vault excels at integration and history tracking, dimensional models remain superior for analytical queries. The transformation layer bridges these two paradigms, providing the best of both approaches.

Presentation layer

The presentation layer focuses on making data accessible to business users through familiar tools and interfaces. Pie uses Looker as their primary BI tool, with its semantic layer mapping directly to the information warehouse’s dimensional structures. This abstraction shields users from the complexity of the underlying data vault while leveraging its benefits.

The architecture anticipates future growth by designing for tool independence at this layer. New reporting or querying tools can plug into the already-transformed information warehouse without major development efforts. Dynamic rules handle calculations that need to adjust based on different aggregation levels or user perspectives.

This separation of concerns means that raw data flows from left to right through increasingly refined stages. As Ken Wood from Pie Insurance explains, “We think of our architecture from left to right. The far left is data in its most raw form and the far right is information that has been fully transformed, cleansed, and is ready to be consumed by the business.” This progression ensures that each layer serves its specific purpose without compromising the integrity or flexibility of the overall system.

Data vault vs star schema

While star schemas have long been the standard for data warehouse design, data vault offers a fundamentally different approach that addresses many traditional limitations. Each has distinct strengths, and understanding their differences helps organizations choose the right approach for their specific needs.

Structural differences

Star schemas organize data into fact and dimension tables, creating a denormalized structure optimized for query performance. Facts contain measurable events while dimensions provide descriptive context. This simplicity makes star schemas intuitive for business users and efficient for reporting tools.

Data vault separates concerns into Hubs, Links, and Satellites, maintaining a normalized structure that prioritizes flexibility and auditability. Rather than optimizing for specific queries, data vault optimizes for change and integration. This fundamental difference in design philosophy leads to very different implementation patterns.

Handling change

Star schemas struggle with structural changes because modifications often require rebuilding entire fact tables or dimensions. Adding new attributes might mean reprocessing years of historical data. Changes in business rules can cascade through multiple tables, creating maintenance nightmares.

Data vault excels at accommodating change through its modular design. New attributes simply require new Satellite tables without touching existing structures. Business rule changes affect only the specific Satellites where those rules apply, leaving the rest of the model untouched.

Historical tracking

Traditional star schemas typically implement slowly changing dimensions (SCDs) to track history, but these approaches have limitations. Type 2 SCDs can cause fact table complications, while Type 1 overwrites history entirely. Managing different SCD types across dimensions adds complexity.

Data vault automatically maintains complete history in every Satellite table. All changes are preserved with full auditability and no special handling required. This built-in historization eliminates the need to choose between different SCD types or implement complex temporal logic.

Load performance

Star schemas often require complex ETL processes with multiple passes to handle lookups, surrogate key generation, and dimension updates. Dependencies between dimension and fact loads can create bottlenecks. Late-arriving dimensions or facts require special handling routines.

Data vault uses insert-only patterns that enable highly parallel loading. Hubs, Links, and Satellites can load independently without complex dependencies. This parallelization significantly reduces load windows and simplifies error recovery, as failed loads can simply be rerun without affecting other tables.

Query complexity

Star schemas shine in query simplicity, requiring minimal joins for most analytical queries. Business users can easily write queries against star schemas without deep technical knowledge. Reporting tools naturally understand star schema patterns, making integration straightforward.

Data vault requires more complex queries due to its normalized structure and historized Satellites. Accessing current data often involves multiple joins and filtering on effective dates. However, this complexity is typically hidden from end users through views or semantic layers that present data in familiar formats.

When to use each approach

Star schemas work best for stable, well-understood reporting requirements where query performance is critical. Departments with fixed KPIs and established business rules benefit from star schema simplicity. Small to medium data warehouses with predictable growth patterns are ideal candidates.

Data vault suits organizations facing frequent change, multiple source systems, or strict compliance requirements. Companies undergoing mergers, acquisitions, or digital transformations need data vault’s flexibility. The approach also excels when building enterprise-wide integration platforms that must serve diverse analytical needs.

Many organizations combine both approaches, using data vault as the integration layer and building star schemas downstream for reporting. This hybrid approach leverages data vault’s flexibility for data integration while providing star schema simplicity for end users. The choice isn’t always either-or; it’s about using the right tool for each layer of your architecture.

Data quality faults with your data vault

gymnast vault representing data vault quality challenges
A gymnast faulting on a vault. Get it? Image via Shutterstock.

There are many benefits to data vault architecture, but it does create more tables with more complex transformations and relationships between upstream and downstream assets than other methodologies. This can create data quality challenges if not addressed properly.

Some challenges can include:

Code maintenance

The ETL code for Hub, Satellite, and Link tables must follow the same rules for common column value definitions (like business and hash key definitions) to enable them to load independently. As a result, any changes to code may have to be done in multiple places to ensure consistency. 

One tip? “Our code generation is metadata driven so when we change the metadata in one place it regenerates the ETL code wherever that particular metadata is used,” said Ken.

Multiple, complex transformations between layers

Transformations are a necessary step in any data engineering pipeline using any methodology, but they can create data quality incidents. This can happen either as transformation code gets modified (perhaps incorrectly) or the input data isn’t aligned with the underlying expectations of the transformation model (perhaps there has been an unexpected schema change or the data doesn’t arrive on time). 

Long blocks of transformation code at multiple layers within a data vault architecture can compound these errors and make root cause analysis more difficult. A best practice here is to keep transformations as simple as possible. 

“We are working to evolve our design to apply complex transformations in only one place, the Information Warehouse, within the data pipeline,” said Ken. “As the raw data vault scales the transformation logic becomes more complex, so we are designing ways to reduce complexity.”

Any load (or other errors) in the data vault hub, link, and satellite tables will mar downstream queries with outputs showing partial or missing data.

“The key is to have automated validation checks or data observability in place to detect these anomalies when they happen,” said Ken.

In this data vault diagram, red dots indicate where a transformation may introduce data quality issues
In this data vault diagram, red dots indicate where a transformation may introduce data quality issues. Image courtesy of the author.

Understanding dependencies

Mapping all the dependencies within a data vault architecture can be challenging, especially if you don’t understand the source data to map it to target Hub, Satellite, and Link tables or if the source systems have additional, unexpected business keys that aren’t in the target model.

“We deal with this by using multi-active Satellite tables. These add a key to the Satellite table to match the grain,” said Ken.” Or, we add Hub and Link tables as new keys are introduced AND it aligns with our business model.”

Scaling testing

Developing and maintaining a series of data unit tests or data quality checks across your data warehouse is already a headache, but with data vault it’s especially difficult. 

That’s because no human can possibly anticipate and write tests for all the ways data can break, and if they could, it is virtually impossible to scale across all of the tables and pipelines in your environment. 

That is especially true when it comes to data vault architectures since there is 3x the surface area to cover and the multiple layers and transformations add even more unknown unknown data quality issues

How data observability improves data vault reliability

Data observability tools can address data vault data quality challenges in several other key areas :

  • One of the hallmarks of Data Vault architecture is that it “collects 100% of the data 100% of the time,” which can make backfilling bad data in the raw vault a pain. Data observability reduces time to detection to enable data teams to close the spigot of broken pipelines and stop the flow of bad data flowing into the raw vault thereby reducing the data backfilling burden. Even better, using strategies such as circuit breakers and health insights, data teams can prevent issues from occurring in the first place.
  • From raw data landing zones down to reporting tables, data observability solutions can make sure that your range of numbers and types of values are as expected.
  • Transformation queries that move data across layers are monitored to make sure they run at the expected times with the expected load volumes, defined in either rows or bytes. Monte Carlo features like automatic data lineage and query change detection will enable Pie (and other organizations utilizing data vault) to greatly accelerate their root cause analysis. No longer will it be necessary to manually trace tables upstream or determine what changes to a large SQL query introduced data anomalies.
  • With more table and column references created by the data vault architecture, the need to monitor for schema changes, such as table or column name changes, also increases.
  • Finally, data observability tools should also be easy to implement across your entire stack, and continue to monitor beyond the initial implementation, so that future satellites and hubs that are added in the future can be certified as safeguarded without the need for more stakeholder review meetings and time spent implementing future tests.

Avoid faults to get the data quality score you deserve

an unimpressed gymnast judge representing data consumer sentiment to data quality challenges in a data vault architecture
For some reason this judge is not impressed that there is a human being twirling sideways in the air. Image via Shutterstock.

Gymnasts perform in front of some tough judges (like this judge from the 2016 Rio Olympics!), but Dan from Finance is an even harsher critic when his quarterly reports are wrong. 

Following the advice of leaders in the data vault space, like Pie, is a good next step. For Ken, success comes down to constant alignment with the business and the business model.

“Avoid the temptation to just model Data Vault tables to easily fit the data coming from a source system – unless that source system’s data model was built based on your business’s data model,” he advised.

Your team has invested significant time and expertise into developing and maintaining a data vault architecture, and ensuring data trust at each step of the way will justify the fruits of your hard-earned labor. 


Interested in how data observability can help the data quality challenges posed by your data vault architecture? Schedule a time to talk to us below.

Our promise: we will show you the product.

Frequently Asked Questions

What are the benefits of a data vault?

Benefits of data vault include suitability for auditing, the ability to quickly redefine relationships, easy addition of new datasets, better organization of data, fast speed-to-insights, and the ability to search and query historical data changes.

What is the data vault information layer?

The data vault information layer refers to the layer where pre-transformed data, following business transformation rules, is stored. This layer typically follows the dimensional (Kimball) star schema model and serves as the foundation for reporting and business intelligence tools.

What is the difference between data warehouse and data vault?

A data warehouse is a broader term referring to a system used for reporting and data analysis, while a data vault is a specific modeling methodology used within a data warehouse to organize raw data into a structure that can feed dimensional models.

What are the layers of data vault architecture?

The layers of data vault architecture include: Ingestion Layer (Landing and Staging raw data), Curation Layer (Organizes raw data into the raw data vault and business data model), Transformation Layer (Transforms and cleans data using business logic), and Presentation Layer (Reporting layer for the majority of users).