Data Warehouse Migration Best Practices
So, you’re planning a cloud data warehouse migration. A worthy quest to be sure. But be warned, a warehouse migration isn’t for the faint of heart.
As you probably already know if you’re reading this, a data warehouse migration is the process of moving data from one warehouse to another. In the old days, data warehouses were bulky, on-prem solutions that were difficult to build and equally difficult to maintain. But the advent of cloud data warehouses like Snowflake has changed the game. Now data teams everywhere are migrating from on-prem solutions to more manageable—and sophisticated—cloud solutions to do everything from reducing costs to increasing productivity.
Now, on its face, a data warehouse migration might seem like a relatively straightforward operation. You take the data from one place, you put it in another place. But in reality, a data warehouse migration to cloud solutions like Snowflake and Redshift requires a tremendous amount of preparation to be successful—from schema changes and data validation to a carefully executed QA process. And that doesn’t even consider the organizational headwinds.
In fact, a data warehouse migration is often one of the most complex and impactful projects in a data engineer’s career. And how you plan for it is the first step to success.
Whether you’re migrating from an on-prem solution to a cloud-based warehouse or from one cloud provider to another, a successful migration starts with building your data warehouse migration strategy. In this post, we’ll examine some best practices for migrating your data to a cloud solution, show you how to develop your own migration strategy, and take a closer look at some popular cloud warehouse solutions that you might consider for your platform.
Let’s jump in!
- How to plan your data warehouse migration strategy
- How to migrate your data warehouse to the cloud
- What to know about each cloud data warehouse
Plan Your Data Warehouse Migration Strategy
From discovering duplicate data to latency and broken pipelines, a migration is stressful under the simplest of circumstances. And when you finally do stand up your new data warehouse there’s still plenty of QA waiting on the other side. So the first step to a successful data warehouse migration is developing your data warehouse migration strategy.
While every migration will look a little different depending on your source and target warehouse, every successful migration strategy will include the following four elements:
- Preparation and discovery
- Data warehouse design
- Data warehouse security
- Data validation
So, let’s take a look at those in a bit more detail.
Data migration preparation and discovery
Part one of a strong data warehouse migration strategy is preparation and discovery. During this phase, you’ll set the goals of your migration, identify stakeholders, define technical requirements, and set your budget.
Define your goals
Data migrations take a tremendous amount of time and resources to do successfully. So before you dive head-first into the migration process, it’s important to define the why of your migration. How will your organization benefit from this investment?
Not only will this help guide the decisions you’ll make as you move through the migration process—but it will also help determine if the BI juice is really worth the squeeze.
Here are some things to consider:
- What are you trying to achieve with a migration? Better governance? Facilitating self-service data? Integrating new tooling?
- What are the competitive advantages you’re trying to achieve?
- What does success look like?
- Which functions of your organization stand to benefit from this migration? How will they benefit and what are the KPIs?
- Is cost-savings a goal? If so, how much? Are you trying to optimize headcount or scale performance?
- What are the goals as they relate to governance and security?
- How should your business intelligence improve as a result of this migration?
- Are there any additional reporting features or functions like ML or AI that a migration is intended to enable?
Set your budget
While you’ll likely save on a data warehouse migration to cloud tooling, the migration process will create costs of its own. So, it’s important to know what you need to spend—and what you can spend.
This is also an opportunity to make or break stakeholder buy-in, so be sure to consider the budgeting culture of your company carefully. Does your finance team prefer capital expenses or operational? You’ll need stakeholder support if your migration is going to success, so structure your budget in a way that considers their needs and motivations.
Get the right people involved
The budget is just one example of how teams and partners can make or break a migration. Migrations require support from everyone from data engineers and stakeholders to cross-functional partners in order to be successful, so it’s critically important to get the right people around the table early.
Who needs to approve this project? Who will be interrupted by it? What teams will be using your new data warehouse? What will they need access to and when? And who’s responsible for managing the platform post-launch?
This is also the right time to consider your team’s skill-set. A data warehouse migration requires specialized skills that not every engineer is likely to have. Evaluate what training and skills will be required pre-, mid-, and post-migration and consider training or staffing up before your project begins.
Data Warehouse Design Best Practices
Once you understand the scope and budget of your new cloud warehouse, you’re ready to move into the design phase. Choosing the right cloud solution out of the gate is critical, so let’s look at some data warehouse design best practices.
First, start by defining how you’ll utilize your cloud solution. A database that’s optimized for the cloud will allow you to get the most out of your new data warehouse. Snowflake is an excellent example of a database that’s been optimized to leverage the benefits of a cloud-architecture. Amazon Redshift and Google BigQuery are other solid examples.
Consider the goals you defined during the preparation and discovery phase, and select several cloud data warehouse options for a deeper dive based on what you want to achieve. Then have a consultative discovery call with your top candidates to determine which solution is best suited for your needs. as defined in the preparation phase of your data migration.
A data warehouse migration also brings with it the opportunity to reimagine your formats and databases to better fit your BI and data product needs. Again, consider the goals you defined during your preparation and discovery phase and let those learnings guide the evolution of your architecture.
A few other factors to consider when designing a new data warehouse include:
- The location of your users. Your database may be in the cloud, but the server that hosts it has a physical location. Choosing a location that’s close to your users will help to optimize the speed of your platform.
- Hosted, managed, or SaaS. Cloud storage will provide the most opportunity, but your goals and budget constraints will help to determine what’s right for your business needs.
- Public, private, hybrid, or multi-cloud. Depending on a number of factors like your security needs, you may choose to host some of your data on-prem or on a private cloud server.
Data Warehouse Naming Conventions Best Practices
How you choose to structure your naming conventions is one nexus of the migration process that can set you up for success or headaches down the road.
The right naming conventions can mean the difference between success and headaches.
Mental cues, version numbers, and other clever breadcrumbs might all seem like a great idea at the outset. The problem is that these cues won’t mean anything to your business users. Instead of making data easier to find, mental cues and esoteric naming conventions will slow productivity for users, limiting return from a new warehouse and increasing the burden on data engineers to locate datasets or troubleshoot dashboards.
Instead of naming datasets from the engineering team’s perspective, choose naming conventions that consider business users and analysts first. Here are some helpful considerations:
- Relevance: are your naming conventions relevant to any data object?
- Consistency: are you being consistent with your naming conventions? Can analysts easily locate any and all data objects based on the naming conventions you’ve selected?
- Approachability: is the language you’re using familiar to everyone in your organization? Have you incorporated any acronyms that wouldn’t be immediately understood by a new business user? Always opt for greater clarity over fewer keystrokes.
- Underscoring: while some cloud data warehouses support spacing or hyphens in object names with the use of additional keystrokes for referencing, others don’t support them at all. Underscoring, however, is supported by all solutions without requiring additional bracketing or identification, making it an ideal solution for naming conventions.
Data Warehouse Security Best Practices
Data governance is critical to the protection of both internal and customer data. Before you begin any data warehouse migration, take the time to thoroughly review your data security protocols. Who has access to your new data warehouse? What stakeholders or analysts have access to what data and when?
This is also an opportunity to consider your backup and recovery practices and how you can mitigate the impact of data loss during the migration process.
Data Warehouse Data Validation Best Practices
Data quality issues can wreak havoc on the success of a data warehouse migration. What’s more, issues in the source data could even be amplified by a new, sophisticated system.
That’s why data validation in the form of a pre-migration data audit is the first step to preparing your data for migration. Is your data accurate? Is it fresh? Are there redundancies or changes in volume?
In order to successfully validate the quality and necessity of your data, you’ll first need to understand what data you have and exactly how it’s accessed by business users. Is your data structured? Where does it come from?
This is also an opportunity to identify if and how you’d like to phase your migration to minimize business disruption.
The audit phase stage of your data warehouse migration will naturally be one of the most time and resource intensive stages of the migration process. Fortunately, a managed data observability solution like Monte Carlo can dramatically reduce the time and resources required for the audit process by automatically identifying quality issues and managing data cleanup within a centralized system. Data observability can also automate the post-migration QA process, enabling data teams to build trust faster in a new warehouse environment.
Data Warehouse Modeling Best Practices
There’s a right way to do data modeling and there’s a wrong way to do data modeling. When it comes to data warehouse modeling best practices, there are 4 things you need to consider: grain, materialization, naming, and governance.
If you’ve been around the data modeling world for any length of time, you’ll no doubt recognize this term. The grain of a table or view defines what any given row represents—or the granularity of the table. If more detail is included, it’s considered less granular. If less detail is included, it’s considered more granular.
Let’s consider users for example. If we have a table for users of a given SaaS product, the grain might be an individual user. If the table was more complicated and included users and the features they have enabled, each user might have multiple rows based on each feature they have enabled.
When designing a new relation, you should:
- Determine the level of detail.
- Clearly title each relationship
- Double-check that each column applies to the appropriate table
When it comes to how you name your relations, the sky is really the limit. There’s no one right way to name. But whatever naming convention you choose, stick to it and apply that same structure consistently.
For relation naming, it’s helpful to use schemas to namespace relations that align to something like the source or business unit.
But whether we’re talking about relation naming or column naming, the goal should always be to encourage human-readability. Names are only as helpful as they are easily interpretable. So make those names as easy and obvious as possible.
Materialization at its most basic level is whether or not your relations will be created as a table or a view. While a table requires data engineering to pre-compute any calculations, allowing for faster query-response times, a view will enable your users to see more real-time data. And what cloud data warehouse you’re migrating to can have a big impact on what trade-offs you’re willing to make.
In general, more materialization is better for a data model. This means you’ve done as much work on the front-end as possible to empower your data consumers to leverage your warehouse effectively.
How you materialize also has a major impact on your warehouse costs. So if you see costs rising from a common table expression or you spot an expensive join, those are signals to materialize.
What data is being housed and the requirements of your business and industry will determine the level and types of governance policies you’ll need to implement. For example, healthcare providers are generally subject to HIPAA regulations pertaining to data management and usage. Working closely with your security team will help to ensure you’re compliant with all currently known regulations.
Data Warehouse Migrations To the Cloud
Whether you’re planning a data warehouse migration to cloud platforms like Snowflake from an on-prem solution or migrating from one cloud warehouse to another, it’s important to scope your project carefully and understand the ins and outs of the migration process.
Now that we’ve discussed the planning of your migration, let’s look at the migration process in a bit more detail:
Step 1. Make a copy of your data
The first step to any cloud warehouse migration is making a copy of your existing data warehouse. To do this successfully, you’ll first need to choose the right data infrastructure. An easy way to do this is by choosing a smaller data set and migrating it to multiple potential warehouses to see how they compare.
Once you’ve chosen the right infrastructure, you can begin copying your data in its entirety. Now, this can obviously become quite challenging since you may be copying hundreds of terabytes of data at any one time. Amazon and Google both offer physical hard drive transfers to simplify this process. Contact your chosen vendor to see what services they can provide.
Finally, verify the schema and format of your data prior to loading, and mark the moment in time of the exported snapshot.
2. Set up an ongoing replication process
To continue your migration, you’ll next want to set up an ongoing synchronization process to manage the replication of schemes and data. This can either be built manually or through data pipeline services like Fivetran.
3. Migrate the remainder of your infrastructure
Once you have your replication process in place, you can begin migrating the remainder of your infrastructure. Start by migrating a low-risk component of your stack like your organization’s business intelligence tooling (Looker, Tableau, etc.) This will give you a chance to get some quick wins before moving on to more technically complex and potentially world-breaking components like legacy data applications.
4. Migrate your transformations
The final step of the migration process is to recreate your data models’ transforms in your new cloud environment. Consider opting for an ELT process to get the most out of your new cloud warehouse. Tools like dbt can enable you to add a transform layer on top of your warehouse, and can even enable you to create a self-serve data platform like a data mesh in the future.
Best practices for a data warehouse migration to the cloud
Catalog your data stack and map dependencies
When you migrate your warehouse to the cloud, it isn’t just your warehouse that’s affected. Since applications naturally become entangled over time, mapping dependencies is critical to avoiding downtime during the migration process.
So, before you migrate, take stock of your entire platform. Make a list of what components are on the cloud, what components you plan to transition, and anything that will continue to be stored on-prem after the migration. Once that’s all done, create a comprehensive list of dependencies to help you re-connect components in the cloud during your migration.
Migrate in phases
Like we said previously, migrations are disruptive. But a few small disruptions are better than one really big one. To avoid the impact of protracted disruptions, plan to migrate your warehouse gradually. Identify logical phases for your cloud data warehouse migration, then move components in a way that impacts as few teams as possible for as little time as possible.
Backup your data
Some people say that anything that can go wrong will. I hope that’s not the case for your data migration, but it’s good to be prepared anyway. So, remember to backup all your data—and test those backups to make sure they’re working—before beginning your migration.
Stick to the plan
A well executed plan is better than an immediate win. Whether things are going better than expected or worse, resist the temptation to strategize on the fly. Migrations aren’t easy. So, be prepared for that going in, and build a strategy that takes those risks into account.
Even the simplest warehouse migration can be a mine-field of potential fail-points. So, test as much as you can as often as you can throughout the migration process. Test your backups. Test potential warehouse vendors. Test your new platform. Test to make sure that what you planned for is what you’ll get once the migration is complete.
What to know about each cloud data warehouse
Data Warehouse Migration to AWS
Amazon Redshift is one of the most common solutions for cloud warehouse users. If you’re planning a data warehouse migration to AWS, Amazon provides AWS SCT, a migration tool which automates the process of converting your data warehouse schema.
Because sources and target databases can have equivalent schemas, AWS SCT will attempt to create an equivalent schema in your target database if possible. If AWS SCT can’t create an equivalent database, it will create a report to help you convert your schema manually, including action items and estimates of the effort involved in the conversion.
AWS SCT, can currently convert the following data warehouse schemas to Amazon Redshift:
- Amazon Redshift
- Azure Synapse Analytics (version 10)
- Greenplum Database (version 4.3 and later)
- Microsoft SQL Server (version 2008 and later)
- Netezza (version 7.0.3 and later)
- Oracle (version 10.2 and later)
- Snowflake (version 3)
- Teradata (version 13 and later)
- Vertica (version 7.2 and later)
If you’re currently using an on-premises data warehouse, Amazon also offers an AWS SCT data extraction tool to extract and migrate your data.
Data Warehouse Migration to Snowflake
Snowflake is one of the most popular and versatile cloud warehouse solutions on the market today. Unlike the other solutions mentioned here, Snowflake is cloud-vendor agnostic, meaning it supports the use of platform tooling from a variety of ecosystems, and offers greater flexibility and interoperability for platform engineering teams.
Snowflake offers a professional services team to manage migrations, but you’ll need to complete a code assessment and create a migration plan before migrating.
One major consideration when planning a data warehouse migration to Snowflake is partitions. Unlike other data warehouses, Snowflake doesn’t support partitions or indexes. Instead, Snowflake automatically divides large tables into micro-partitions to calculate statistics about the value ranges of each column, which in turn, determines what’s required to run your query.
For most practitioners, the shift from indexes to micro-partitions shouldn’t be a big deal. In fact, people often choose to migrate to Snowflake specifically because this approach reduces query latency. But just to be safe, here are a few tips:
- Document your current data schema and lineage. This will be important when you have to cross-reference your old data ecosystem with your new one.
- Analyze your current schema and lineage, and determine if this structure and its corresponding upstream sources and downstream consumers make sense for how you’ll be using the data once it’s migrated to Snowflake.
- Select appropriate cluster keys. This will ensure the best query performance for your team’s access patterns.
Migrate Data Warehouse to Azure
Microsoft Azure is another cloud warehouse platform that offers robust migration assistance for prospective data teams. Arguably one of the most robust systems for migration, The Azure Migration Program is a collection of tools and resources to simplify the migration process for data teams migrating to Microsoft Azure, including:
can assist with specific aspects like security, Kubernetes on Azure, SAP on Azure, data warehouse or databases on Azure, or provide an end-to-end solution that covers all of your migration needs
- FastTrack for Azure: a support and technical data enablement program to help data teams design and deploy cloud solutions faster—including practice tools and architectural guidance from Azure engineers.
- Azure Migration Partners Center: a collection of certified partners who can support specific aspects of a migration, like security, Kubernetes, and SAP, or end-to-end migration management.
- Azure Migrate: a centralized hub to launch, run, monitor, and manage the migration process—including assessment and automation.
Google Cloud Data Warehouse Migration
There’s no question that Google BigQuery is one of the top contenders in the cloud data warehouse space. But unlike some of the other cloud data warehouses mentioned here, when it comes to migration, Google BigQuery doesn’t offer managed migration services out of the box. While Google’s migration documentation is fairly robust, the execution of your migration strategy will rely on you to select a third party migration from Google’s approved global partners and consulting service programs.
If you do end up selecting Google BigQuery as your new cloud data warehouse, we recommend selecting and scheduling a call with your migration partner as early as possible to make sure you’re fully informed as you prepare your data migration strategy.
By taking your time to develop a comprehensive data warehouse migration strategy—and leveraging a few best practices—your data team can run a warehouse migration that meets goals, builds trusts, and creates real value for your business users.
Remember that a well executed plan is better than a quick win. Limiting business disruption and maximizing your operational efficiency are your two biggest objectives during the migration process.
And don’t be afraid to leverage tools that can simplify your migration process. Automation tools like Fivertran, data observability tools like Monte Carlo, and even native migration resources can all serve to streamline your migration process and deliver a successful migration faster.
Like anything worth doing, a data warehouse migration is never easy. But with the right planning—and a few best practices—you’ll be on your way to leveraging a shiny dew cloud data warehouse in no time (ish).
Interested in learning about how data observability can simplify your data warehouse migration? Send us a note or check out our demo below!