A data catalog is a constantly updated inventory of the universe of data assets within an organization. It uses metadata to create a picture of the data, as well as the relationships between data assets of diverse sources and the processing that takes place as data moves through systems.
Data catalogs are important today because they allow users of varying types to access useful data quickly and effectively and can help team members collaborate and maintain consistent organization-wide data definitions.
Additionally, data catalogs can centrally enforce rules of access, security, and governance, all in a way that is both automatic and visible.
There are many options to choose from when considering a data catalog. Here is an overview of some popular products compatible with the modern data stack.
The data catalogs surveyed below are:
- Alation Data Catalog
- Alteryx Connect
- erwin Data Catalog
- Informatica Enterprise Data Catalog (EDC)
- Select Star
- Talend Data Catalog
- A data catalog is only as good as the data it catalogs
Alation Data Catalog
Alation Data Catalog relies on their Behavioral Analysis Engine to utilize advanced artificial intelligence and machine learning. The use of popularity-driven relevancy brings the most useful information forward and the product creates in-workflow governance to maintain data policies.
The architecture is containerized and improves timeliness of data onboarding and time-to-analysis.
Alation also supports multiple deployment styles, giving organizations the option of managing data themselves or having it remotely managed on the cloud, or other options in between. Alation’s Open Data Quality Initiative allows smooth data sharing between sources.
With Alteryx, you can create workflows without needing to code by using the provided automation building blocks. Alteryx allows you to integrate with more than 80 data sources, including spreadsheets, cloud sources, and many more, and data can be extracted from semi- and unstructured sources like PDFs.
In turn, Alteryx can output to multiple different tools and the Alteryx SDK can be leveraged to embed functionality into a variety of interfaces.
Ataccama has created a product to help organizations readily collaborate between data engineers, data stewards, and business users, by sharing information, questions and discussions.
Governance can be handled at a granular level and access control becomes part of the custom workflow. Large volumes of data from various sources can be connected and processed, and AI and automated algorithms help automatically detect business rules, as well as assign data quality rules automatically.
With Ataccama, AI detects related and duplicate datasets. Data systems are continuously and automatically monitored without need to set rules manually, and data can be tracked as it moves and is transformed through the system. As data structure changes in connected systems, the changes are automatically captured and imported to the data catalog.
Ataccama allows business users to create and visualize data stories and to create custom metrics and embed them. Ataccama can be deployed anywhere and can be customized without coding via a metadata-based configuration.
Atlan compares itself to Netflix for data, supporting multiple experiences for different kinds of users’ needs through its use of Personas. Each user has a customized homepage, custom metadata, and access to data curated to their workflows.
Atlan’s Purposes allow you to create policies and grant access to data assets by business domains and project context.
Atlan’s Compliance controls access to sensitive assets, which can also be auto-identified.
Atlan supports natural language search and the ability to use business metrics to find associated linked assets, all throughout the entire data asset universe. Atlan is built on open source and all actions are API-driven. Atlan’s custom metadata builder has a no-code interface and allows you to easily share with other users. It also allows you to collaborate and communicate using common communication and workflow tools and plug-ins without leaving Atlan.
Castor’s powerful search of data assets and reuse of queries by other team members allows productivity boosts. Documentation can be largely automated and it propagates with lineage. AI allows the mapping of sensitive data and controls can be placed around access to it.
Column-level, cross-system automated data lineage allows users to track data flows between systems and allows impact analysis in times of change. Data can also be cataloged with business context to handle questions that may arise.
Metadata transparency allows users on different teams access to data knowledge so it doesn’t get siloed in data teams. Creating a single repository of data definitions, both automatically and collaboratively, allows all users in the organization to work from the same foundations.
Coginiti Premium is collaboration software by the company formerly known as Aginity. It is a SQL-forward tool that allows various teams and individuals to create, find, share, and re-use code.
Data Engineers are able to create reusable components that work against any data platform. Business users have an all-in-one tool to explore and analyze data. SQL code is reusable and shareable, creating opportunities for collaboration instead of rework.
As changes are made, dependencies are visible and updated automatically. Security and sharing controls are managed centrally through flexible controls.
Collibra Data Catalog can monitor data quality and pipeline reliability against more than 40 databases and file systems, allowing data teams to react immediately to detected issues.
A core of data auto-validation and automatic discovery combined with the protection of sensitive data allows immediate responsiveness to issues. Machine learning helps create automatic workflows that support user collaboration.
Collibra allows you to assign roles and responsibilities to users and create and enforce data policies across your organization with a no-code policy builder. Collibra’s native lineage harvesters extract and maintain lineage automatically and can be visible and accessible to all.
data.world uses a knowledge graph for easy, visual access to data discovery, governance, and analysis. data.world features a friendly UI with easy search that returns the most relevant concepts from the knowledge graph.
data.world enables federated queries, which allow users to explore and join data regardless of where the data is hosted.
Queries can return virtual data, with live connections, across data sets, giving users crucial feedback when building and modifying queries. data.world’s Eureka Automations allow you to automate imports and auto-generate a business glossary and relationships.
erwin Data Catalog
erwin Data Catalog creates a central metadata repository with full versioning and change management that represents all the connected data that has been automatically harvested and cataloged. erwin produces data lineage and enables IT teams to enforce data governance.
Data quality assessment can be automated and data quality tools allow issues to be noticed earlier for faster remediation. Data governance processes are integrated.
Informatica Enterprise Data Catalog (EDC)
Informatica’s Enterprise Data Catalog uses AI-powered automation to catalog your data. Lineage is automated to show how data moves through systems.
Data quality rules, metrics, and scorecards allow you to understand your data quality and relationships. Data intelligence is created and shared collaboratively to ensure quality and trust.
Metaphor takes a modern approach to metadata by creating a social environment for data consumption, from the use of social hashtags in the data, social posts to share information, to automating a live wiki to access documentation.
Users can navigate through levels of data schema, run Google-like searches or use special crafted search operators to find what they are looking for. Popularity indicators help business users find the most in-demand information and governance tags can be used to structure data and define its access.
Metaphor is a SOC2-compliant SaaS solution that can be integrated into your data stack and is scalable to your environment.
The name Secoda comes from “searchable company data,” and Secoda strives to make data exploration useful and intuitive for everyone. Secoda offers a no-code path to centralize your data knowledge and can easily scale with you as you add new sources.
Secoda provides automatic documentation and data knowledge. Data requests can be handled directly inside Secoda and can be referenced by others later.
Secoda’s taggable knowledge documents utilize executable queries and charts and additional team knowledge. Secoda offers automated lineage that can be further enhanced manually or with the Secoda Lineage API.
Select Star is a modern data discovery platform built for the cloud. Its highly automated platform and intuitive UI delivers insights into your data model so data engineers and non-technical stakeholders can easily understand the context behind their data. With native integrations to popular data warehouses, ETL and BI tools, you can set up your catalog in <1 hr.
Select Star automates lineage, ERDs, and documentation / tag propagation, limiting the manual effort required for curating your data. It also provides a universal search which uses popularity to help surface the most relevant results across all of your data sources.
Select Star’s open API makes it easy to programmatically manage your data or integrate with other tools, and permission based access control gives data teams full governance of their metadata.
Stemma facilitates a self-serve data culture, with automated documentation supporting trustworthiness and ease of use. Change management is handled with visibility and communication to downstream users.
Stemma integrates with tools and workflows to become an integral part of user business processes.
Talend Data Catalog
Talend Data Catalog automatically discovers and classifies data and makes it easy for users to search and access what they need.
Talend Data Catalog can be deployed in the cloud, on-premises, or hybrid and can integrate data from any source. It features end-to-end data lineage and custom object and role-based security.
Zeenea is a SaaS solution that is easily scalable and can be connected to any data source. Its physical and logical metamodel allows you to visualize and document your data and its relations.
Zeenea allows different ways for users to find the data they are looking for: a simple keyword search with a smart filtering system or direct catalog browsing. Data lineage can be examined through a user-friendly lineage graph which allows for increased trust throughout the organization in the available data.
Zeenea also enables traceability capabilities for compliance reports and their business glossary allows consistency of terminology throughout the organization.
A data catalog is only as good as the data it catalogs
Some of the best data teams are investing in data observability before kicking off a data catalog initiative. Data comes in fast and messy and it only takes one instance of conflicting or missing data to lose trust in a data catalog, kicking off a time-to-value death spiral.
Setting up a data observability solution first like Monte Carlo acts as an adoption accelerant by providing insights into the health, usage, and lineage of your data.
Each dataset cataloged and discovered can then be labeled and certified conveying the appropriate level of support and therefore trust it should engender.
Did we miss one? Tell us in the comments.
Interested to understand how data observability can improve your data catalog initiative? Schedule a time to speak with us using the form below!