ūü߆ Second Brain


Search IconIcon to open search


Last updated Dec 6, 2023

If you go one step further, let’s say you choose one of the above technologies, you will most probably run into the need to handle intermediate levels in between. For example, to prepare, wrangle, clean, copy, etc., the data from one to another system or another format, especially if you are working with unstructured data as these need to be mingled in a structured way at the end in one or another way. To keep the overview and handle all these challenging tasks, you need an Orchestrator and some cloud-computing frameworks which I will explain in the two following chapters to complete the full architecture.

Orchestrations are doing these things at heart:

Making orchestrators experts on:

Traditional Orchestrators focus on tasks. But newer generations e.g. Dagster focus on Data Assets ^gvdd3tf and Software-Defined Asset, which makes scheduling and orchestration much more powerful. For more see Dagster. It ties in with the Modern Data Stack.

# What is an Orchestrator

See What is an Orchestrator.

# The role of Orchestration is Mastering Complexity

See key features in RW Building Better Analytics Pipelines

# Tools

After you choose your group and even the technology you want to go for, you want to have an Orchestrator. This is one of the most critical tasks that gets forgotten most of the time.

Read more on Data Orchestration Trends: The Shift From Data Pipelines to Data Products | Airbyte.

# When to use which tools

# Evolution of Tools

from Data Orchestration Trends- The Shift From Data Pipelines to Data Products

Traditionally, orchestrators focused mainly on tasks and operations to reliable schedule and workflow computation in the correct sequence. The best example is the first orchestrator out there,  cron. Opposite to crontabs, modern tools need to integrate with the Modern Data Stack.

To understand the complete picture, let’s explore where we came from before Airflow and other bespoken orchestrators these days.

  1. In 1987, it started with the mother of all scheduling tools,  (Vixie) cron
  2. to more graphical drag-and-drop ETL tools around 2000 such as  Oracle OWB,  SQL Server Integration Services,  Informatica 
  3. to simple orchestrators around 2014 with  Apache Airflow,  Luigi,  Oozie
  4. to modern orchestrators around 2019 such as  Prefect,  Kedro,  Dagster, or  Temporal

If you are curious and want to see the complete list of tools and frameworks, I suggest you check out the  Awesome Pipeline List on GitHub.

# Which tool

As of 2022-09-21:

Or said others in this Tweet - I’d use:

Also heard from Nick in the podcast Re-Bundling The Data Stack With Data Orchestration And Software Defined Assets Using Dagster | Data Engineering Podcast.

# What language does an Orchestrator speak?

What language does an Orchestrator speak

References: Python What is an Orchestrator Why you need an Orchestrator Apache Airflow