🧠 Second Brain

Search

Search IconIcon to open search

Data Catalog

Last updated Nov 4, 2024

In our data-driven world, the volume of data is expanding at an unprecedented rate. Remarkably, 90% of the world’s data has been generated in just the past two years. Managing and organizing this rapidly growing data can be daunting. This is where a Data Catalog becomes essential.

A Data Catalog serves as a centralized repository, making metadata about your data searchable. In an era dominated by Data Lakes and various data storage solutions, the ability to efficiently locate your data is crucial. Think of it as a Google Search for your internal Metadata.

For those interested in the evolution of Data Catalogs, a fascinating starting point is the 2017 paper on Data Context Service, which provides valuable insights into their origins.

For a comprehensive overview of available tools, check out the Awesome Data Discovery and Observability compilation on GitHub. Another notable resource is Choosing a Data Catalog - by Sarah Krasnik, offering guidance on selecting an appropriate Data Catalog.

# Tools

# Tools and Features


Image from GitHub - opendatadiscovery/awesome-data-catalogs

# Minimal Data Catalog with Orchestration

Sometimes, a data catalog is just a list of your tables (S3/database). Orchestrator can be a great place to have such lists, including a ton of metadata. See Dagster here It integrates with most data tools, including column lineage: - bsky


References: Unity Catalog
Created 2022-02-19