The Weekly ETL: How Do You “Thin Slice” a Data Pipeline?

In Monte Carlo’s Weekly ETL (Explanations Through Lior) series, Lior Gavish, Monte Carlo’s co-founder, and CTO answers a trending question on Reddit about some of data engineering’s hottest topics. 

Reddit thread can be found here

Reddit user /treacherous_tim asks how do you “thin slice” a data pipeline and if anyone has faced this challenge before? 

First, I think it’s great that data engineers are now following best practices from DevOps and software engineering, in this case, starting with an MVP of your solution before you invest in building out a larger system or service. Having been on and led software engineering teams for almost two decades, I understand how important it is to get things working as quickly as possible to show the value of your product straight out of the gate. 

In the case of building a data pipeline, you need to “slice” your data, not your solution. I would recommend building a data pipeline that runs on a subset of data, perhaps from only one source, extract it, transform it, and transport it in the requested format to where you can begin analyzing and visualizing it. This is the best way to start because you can produce something useful to analysts and data scientists as quickly as possible. The key is to manage scope and ask yourself – how can we start answering the business problem, at least partially, with a simpler analysis and simpler data? You can then layer on additional sources and additional granularity and precision in order to answer the questions more fully over time. To distill the most important requirements from the nice-to-haves, you can often benefit from collaborating with your analysts and data scientists.