If you go one step further, let’s say you choose one of the above technologies, you will most probably run into the need to handle intermediate levels in between. For example, to prepare, wrangle, clean, copy, etc., the data from one to another system or another format, especially if you are working with unstructured data as these need to be mingled in a structured way at the end in one or another way. To keep the overview and handle all these challenging tasks, you need an Orchestrator and some cloud-computing frameworks which I will explain in the two following chapters to complete the full architecture.
Orchestrations are doing these things at heart:
- invokes computation at the right time
- models the dependencies between computations
- tracks what computation ran
Making orchestrators experts on:
- When stuff happens
- when stuff is going wrong
- what it takes to fix the wrong state
Traditional Orchestrators focus on tasks. But newer generations e.g. Dagster focus on Data Assets ^gvdd3tf and Software-Defined Asset, which makes scheduling and orchestration much more powerful. For more see Dagster. It ties in with the Modern Data Stack.
# What is an Orchestrator
# The role of Orchestration is Mastering Complexity
See key features in RW Building Better Analytics Pipelines
- Apache Airflow (created in Airbnb)
- Luigi (created in Spotify)
- Azkaban (created at LinkedIn)
- Apache Oozie (for Hadoop systems)
After you choose your group and even the technology you want to go for, you want to have an Orchestrator. This is one of the most critical tasks that gets forgotten most of the time.
# When to use which tools
# Evolution of Tools
Traditionally, orchestrators focused mainly on tasks and operations to reliable schedule and workflow computation in the correct sequence. The best example is the first orchestrator out there, cron. Opposite to crontabs, modern tools need to integrate with the Modern Data Stack.
To understand the complete picture, let’s explore where we came from before Airflow and other bespoken orchestrators these days.
- In 1987, it started with the mother of all scheduling tools, (Vixie) cron
- to more graphical drag-and-drop ETL tools around 2000 such as Oracle OWB, SQL Server Integration Services, Informatica
- to simple orchestrators around 2014 with Apache Airflow, Luigi, Oozie
- to modern orchestrators around 2019 such as Prefect, Kedro, Dagster, or Temporal
If you are curious and want to see the complete list of tools and frameworks, I suggest you check out the Awesome Pipeline List on GitHub.
# Which tool
As of 2022-09-21:
- Airflow when you need a task scheduling only (no data awareness)
- Dagster when you foresee higher-level data engineering problems. Dagster has more abstractions as they grew from first principles with a holistic view in mind from the very beginning. They focus heavily on data integrity, testing, idempotency, data assets, etc.
- Prefect if you need a fast and dynamic modern orchestration with a straightforward way to scale out. They recently revamped the prefect core as
Prefect 2.0 with a new second-generation orchestration engine called
Orion. It has several abstractions that make it a swiss army knife for general task management.
- With the new engine Orion they build in Prefect 2.0, they’re very similar to Temporal and supports fast low latency application orchestration
Or said others in this Tweet - I’d use:
- airflow for plain task-scheduling
- prefect fast, low-latency imperative scheduling
- dagster for data-aware pipelines when you want best-in-class, but opinionated support
Also heard from Nick in the podcast Re-Bundling The Data Stack With Data Orchestration And Software Defined Assets Using Dagster | Data Engineering Podcast.