Modern Data Stack
The Modern Data Stack (MDS) comprises a suite of open-source tools designed for end-to-end analytics. This includes data ingestion, transformation, machine learning, and integration into a columnar data warehouse or lake solution, all complemented by an analytics BI dashboard backend. The stack’s versatility allows extensions for data quality, data cataloging, and more.
MDS aims to enable data insights using the best-suited tools for each process. It’s worth noting that “Modern Data Stack” is a relatively new term, with its definition still evolving.
A burgeoning term, ngods (new generation open-source data stack), has emerged. Previously, I’ve referred to this concept as the Open Data Stack Project. Additionally, Dagster introduced the term DataStack 2.0 in a recent blog post. Open Data Stack is my own definition of it.
Closed Source vs Open Source
Closed Source examples: dbt, Looker, Snowflake, Fivetran, Hightouch, Census
Open Source alternatives: airbyte, dbt, dagster, Superset, Reverse-ETL?
Modern Data Stack on a Laptop
# Why the Modern Data Stack?
A perspective from Reddit highlights the shift in data warehousing and analytics. It underscores the reduced need for extensive teams and infrastructure, thanks to new tools that streamline data management and reporting. Particularly for small and mid-sized companies, MDS offers a competitive edge in data handling, allowing even a single data engineer to manage vast datasets efficiently.
A notable article discussing Lakehouse, Metrics Layer, and Clickhouse:
The Next Cloud Data Platform | Greylock
# Integrating with Dagster
The downside of MDS is the unbundling of Bundling vs Unbundling- Monolith Data vs Microservices, but Dagster helps integrate the full data stack together:
Dagster elevates the Modern Data Stack:
Explore more about its power with Dagster and Data Assets.
# A comment I made on Social
From there, I would integrate additional open-source tools based on specific needs: Spark for processing, Delta Lake for data lake formatting and ACID Transactions, Amundsen for data cataloging, and Great Expectation for data quality, among others. For smaller projects, DuckDB is suitable for local OLAP scenarios, while Kubernetes and DevOps provide scalability.
Feel free to reach out for further discussion or clarifications.
# Further Links
# Modern Data Stack Alternatives
I’m with you at least 2/3 of the way. My preferred stack is PRQL + DuckDB + Dagster. I evaluated the space for work at my current company (was originally only DE, handling ingests from ~300 sources across various systems, on order of ~1k downstream tables in dbt + hundreds of dashboards + a handful of business-critical data+app features; now leading a small team).
I came away ranking dagster first, prefect second, everything else not close. IMO dagster wins fundamentally for data engineers bc it picks the right core abstraction (software defined assets) and builds everything else around that. Prefect for me is best for general non-data-specfic orchestration as a nearly transparent layer around existing scripts. PRQL as a DuckDB Extension | Hacker News
# Other Data Stacks
Ask Simon | astorik
References: Cloud Data Warehouses Data Orchestrators Dagster Amundsen Data Catalog Python dbt Superset Metrics Layer Kubernetes Use closed-source if you don’t have the developers or the time Ascend.io Palantir Foundry Wiki