🧠 Second Brain

Dagster

Last updated Mar 5, 2024

Dagster is one of many Data Orchestrators. It operates in a declarative and data-aware manner, offering unique capabilities in data orchestration.

# What is Dagster?

Dagster is an orchestrator that’s designed for developing and maintaining data assets, such as tables, data sets, machine learning models, and reports.

You declare functions that you want to run and the data assets that those functions produce or update. Dagster then helps you run your functions at the right time and keep your assets up-to-date.

Dagster is built to be used at every stage of the Data Engineering Lifecycle - local development, unit tests, integration tests, staging environments, all the way up to production.

# Why Choose Dagster?

Discover the ease of migration from Apache Airflow to Dagster with this insightful YouTube video.

For a concise introduction: Dagster Data Orchestration 10 min walkthrough - Jan 2023 - YouTube
Understanding Partitions:
- Pedram Navid’s demonstration on partition and backfill can be found here.
- Learn more about dagster partition.
Comparison with Apache Airflow:
- Apache Airflow vs. Dagster - YouTube
- In-depth comparison: Dagster vs Airflow - why data teams are switching
Delve into Software-Defined Asset
Rethinking Orchestration as Reconciliation: Software Defined Assets in Dagster | Elementl - YouTube
Further explained in Data Orchestration Trends- The Shift From Data Pipelines to Data Products

# Escaping the MDS Trap

Key insights from the launch week 2023-10-09 are available here.

# Dagster and Functional Data Engineering

In my workflow, Dagster is integral for all Python-related tasks. Its framework encourages functional programming practices, helping write code that is declarative, abstracted, idempotent, and type-checked. This approach aids in early error detection. Dagster’s features include simplified unit testing and tools for creating robust, testable, and maintainable pipelines. For more insights, see my article “The Shift From a Data Pipeline to a Data Product” in Data Orchestration Trends- The Shift From Data Pipelines to Data Products.

Learning functional programming languages has reshaped my thinking process. For those interested in integrating functional programming within Python, explore Python and Functional Programming. Origin: Simon Späti on LinkedIn: #dataengineering #idempotent #declarative

# From Imperative to Declarative

We transition from imperative to declarative programming (refer to Declarative vs Imperative). This shift is akin to the movement towards declarative entities in Frontend and DevOps. In data, the declarative entity is the Data Asset (e.g., dashboard, table , report, ML model).

Before implementing Dagster:

Challenges illustrated:
- Issues like duplicated data and inconsistent intervals:

After adopting Dagster and Data Assets:

Asset view transformation:
- Each box represents a physical asset, not merely a task or operation, differentiating it from Apache Airflow.
- This leads to decentralized dependencies, resulting in a more scalable graph.
Integration of SQL upstream logic with actual data assets:

This approach elevates the Modern Data Stack to a new level. For a comprehensive understanding, see Modern Data Stack.

# Conclusion

# Managing Schedules Externally

Explore external schedule management with Process Manager for Dagster.

# Building Better Analytics Pipelines

The event on 2023-05-10 offered valuable insights. Watch the full discussion here.

Pedram’s demonstration using Steampipe: Find more details in the GitHub repository: dagster/README.md at master · dagster-io/dagster · GitHub

# Dagster Cloud

Learn more about Dagster Cloud.

# History

The project was started in 2018 by Nick Shrock and was conceived as a result of a need identified by him while working at Facebook. One of the goals of Dagster has been to provide a tool that removes the barrier between pipeline development and pipeline operation, but during this journey, he came to link the world of data processing with business processes.

See more on Bash-Script vs. Stored Procedure vs. Traditional ETL Tools vs. Python-Script - 📖 Data Engineering Design Patterns (DEDP).

Check out more on awesome-dagster, dagster-open-platform and devrel-project-demos.