🧠 Second Brain

Search

Search IconIcon to open search

Orchestration

Last updated Jul 25, 2024

Delving deeper, after selecting technology from the myriad available, you’ll inevitably confront the need to manage intermediate levels. This is particularly true when handling unstructured data, which necessitates transformation into a structured format. Orchestrators and cloud-computing frameworks play a crucial role in this process, ensuring efficient data manipulation across different systems and formats. In the following chapters, I’ll elucidate their role in completing the full architectural picture.

At their core, orchestrators:

Orchestrators excel in:

While traditional orchestrators are task-centric, newer ones like Dagster emphasize Data Assets and Software-Defined Asset. This approach enhances scheduling and orchestration, as discussed in Dagster. These advancements align with the Modern Data Stack concepts.

# What is an Orchestrator

For a detailed understanding, refer to What is an Orchestrator.

# The Role of Orchestration in Mastering Complexity

Explore the key features in RW Building Better Analytics Pipelines.

# Abstraction: Data Pipeline as Microservice

Abstractions let you use data pipelines as a microservice on steroids. Why? Because microservices are excellent in scaling but not as good in aligning among different code services. A modern data orchestrator has everything handled around the above reusable abstractions. You can see each task or microservice as a single pipeline with its sole purpose-everything defined in a  functional data engineering way. You do not need to start from zero when you start a new microservice or pipeline in the orchestration case.

More on Data Orchestration Trends: The Shift From Data Pipelines to Data Products.

# Tools


Image from dlthub.com.

Selecting a technology should be followed by choosing an Orchestrator. This crucial step often goes overlooked.

For more insights, read Data Orchestration Trends: The Shift From Data Pipelines to Data Products.

# History: Evolution of Tools

Orchestrators have evolved from simple task managers to complex systems integrating with the Modern Data Stack. Let’s trace their journey:

  1. 1987: The inception with (Vixie) cron
  2. 2000: The emergence of graphical ETL tools like Oracle Warehouse Builder (OWB), SSIS, Informatica
  3. 2011: The rise of Hadoop orchestrators like Luigi, Oozie, Azkaban
  4. 2014: The rise of simple orchestrators like Airflow
  5. 2019: The advent of modern orchestrators with Python like Prefect, Kedro Dagster, Temporal or even fully SQL framework dbt
  6. To declarative pipelines fully managed into Ascend.io, Palantir Foundry and other data lake solutions

For an exhaustive list, visit the Awesome Pipeline List on GitHub. More on the history on Bash-Script vs. Stored Procedure vs. Traditional ETL Tools vs. Python-Script - 📖 Data Engineering Design Patterns (DEDP).


Also check GitHub Star History, eventough they don’t tell you much.

A nice illustration by dlt on On Orchestrators: You Are All Right, But You Are All Wrong Too | dlt Docs:

# When to Use Which Tools

# Control Plane

As of 2024-07-09:
Data orchestrators are the control pane that keeps the heterogeneous data stack together. I like dagster; even though it’s harder to start, it “forces” you, gently :), to use good practice. For example, technical code can be distinguished into resources for everyone to resource (typically data engineers maintain), and business logic can be written by domain experts that nowadays can be written directly close to the data assets. These are declarative and can easily be used to automate and version. Also, everything can run locally with a mocked spark cluster as on production with databricks, without changing any line of DAG config; the only thing is to define run-configs for each environment.

# Different Types of Orchestration

As of 2022-09-21:

Or said others in this Tweet - I’d use:

Also, explore insights from the podcast Re-Bundling The Data Stack With Data Orchestration And Software Defined Assets Using Dagster | Data Engineering Podcast with Nick Schrock.

# Comparing Dagster with Vim

Dagster is vim for orchestration. It has a steeper learning curve; you need to learn its concepts. Initially, it’s harder, but with complex/heterogeneous data infrastructure, these concepts can save you time and money.

Take vim motions; they are hard to learn but worth every minute if you write/code all day. Like orchestrating, if data and managing complexity is core to your business, it’s worth having a robust, battle-tested architecture in place, and you get it out of the box with dagster. Tweet.

Fun is another great analogy to vim. Vim to me is more fun to use than VS Code (see PDE). Dagster is also more fun for data engineers as it focuses heavily on data engineers and developer productivity.

# What language does an Orchestrator speak?

What language does an Orchestrator speak


Origin:
References: Python What is an Orchestrator Why you need an Orchestrator Apache Airflow
Created: