The Weekly ETL: How Do You Document Your Data Assets?

In Monte Carlo’s Weekly ETL (Explanations Through Lior) series, Lior Gavish, Monte Carlo’s co-founder and CTO, answers a trending question on Reddit about some of the data industry’s hottest topics. 

Reddit user _Niwubo asks how data teams can go about setting up a solution for documenting their data assets.  As someone who has built cataloging initiatives from scratch, I can assure you that it’s never seamless and takes buy-in from your whole organization (which can be hard if your company isn’t data-driven).  

The first thing I recommend you do is to evaluate whether it makes sense to build an in-house solution for data cataloging or invest in a third-party vendor to provide that solution for you. There are pros and cons to each of these solutions. I have seen B2C companies such as Airbnb, Netflix, and Uber build their own data catalogs to ensure that their particular needs and stack are supported. However, you have to remember that these organizations handle insanely large amounts of data and have the engineering resources available to invest in building and maintaining the solution. Also, keep in mind that oftentimes custom-built solutions can lead to limited visibility and collaboration, given they may not be able to fully support all use cases. 

Third-party vendors such as Alation, Collibra, and Informatica offer solutions for data cataloging with extensive capabilities. These tools are great for collaboration if you have strong buy-in to implement the project. One of the challenges though – which applies to both homegrown solutions and vendor solutions – is the amount of investment required to actually document your data. You will be spending a good amount of time herding the organization to produce the documentation that powers these solutions and makes them valuable. 

For those interested, my co-founder Barr recently wrote an article going into further detail about this discussing why many data catalogs aren’t meeting the needs of the modern data stack, and how a new approach – data discovery – is needed to better facilitate metadata management and data reliability.

How does your organization go about documenting data assets? Reach out to Lior Gavish with any comments or suggestions.