🧠 Second Brain

Search

Search IconIcon to open search

Apache Airflow

Last updated Feb 9, 2024

As outlined in Orchestrators, the scheduling and monitoring of workflows stand as pivotal choices. While there are numerous Orchestrators, Airflow emerges as the most prevalent and acclaimed.

Traditionally, ETL tools such as Microsoft SQL Server Integration Services (SSIS) dominated the scene, serving as hubs for data transformation and cleaning, as well as Normalization processes.

However, contemporary architectures demand more. The value of code and data transformation logic now extends beyond their immediate functional use, proving essential to other data-informed individuals within an organization.

I highly recommend delving into Maxime Beauchemin’s piece on Functional Data Engineering — a modern paradigm for batch data processing for a deeper understanding of modern data pipelines.

# What is Apache Airflow?

Apache Airflow™ is an open-source platform for developing, scheduling, and monitoring batch-oriented workflows.

Airflow’s extensible Python framework enables you to build workflows connecting with virtually any technology. A web interface helps manage the state of your workflows. Airflow is deployable in many ways, varying from a single process on your laptop to a distributed setup to support even the biggest workflows.

Apache Airflow Core, which includes webserver, scheduler, CLI and other components that are needed for minimal Airflow installation.  Read the documentation »

# Paradigms: Workflows as code

The main characteristic of Airflow workflows is that all workflows are defined in Python code. “Workflows as code” serves several purposes:

# Why Airflow?

Airflow is a batch workflow orchestration platform. Unlike others such as Dagster, Kestra, which focus on data-aware orchestration, Airflow is mainly to manage tasks and workflows.

The Airflow framework contains operators to connect with many technologies and is easily extensible to connect with a new technology. If your workflows have a clear start and end, and run at regular intervals, they can be programmed as an Airflow DAG.

If you prefer coding over clicking, Airflow is the tool for you. Workflows are defined as Python code which means:

Rich scheduling and execution semantics enable you to easily define complex pipelines, running at regular intervals. Backfilling allows you to (re-)run pipelines on historical data after making changes to your logic. And the ability to rerun partial pipelines after resolving an error helps maximize efficiency.

Read more Airflow Documentation .

# We are all using Airflow wrong

# Airflow Commands

To obtain the current executors (Supported: LocalExecutor, CeleryExecutor, KubernetesExecutor, LocalKubernetesExecutor, CeleryKubernetesExecutor):

1
airflow config get-value core executor

# Airflow Operators

# PythonOperator

The Airflow PythonOperator is optimal when the business logic and code are housed within the Airflow DAG directory. The PythonOperator facilitates the import and execution of these components.

1
2
3
4
5
6
7
8
9
airflow
    \__dags
        \_classification_workflow.py
        \_ tweet_classification
            \_preprocess.py
            \_predict.py
            \_ __init__.py
    \__logs
    \__airflow.cfg

# Pros

  1. Ideal when the code is in the same repository as Airflow.
  2. User-friendly and straightforward.
  3. Efficient for smaller teams.

# Cons

  1. Tightly couples Airflow code with business logic.
  2. Changes in business logic necessitate Airflow code redeployment.
  3. Sharing a single Airflow instance across multiple projects becomes challenging.
  4. Limited to Python code.

# DockerOperator

Caution Advised

The DockerOperator is becoming obsolete. It’s recommended to opt for the . As highlighted in this StackOverflow discussion, “The real answer is to use the KubernetesPodOperator. DockerOperator will soon lose its functionality with the phasing out of dockershim.”

The DockerOperator in Airflow manages business logic and code within a Docker image. Upon execution:

  1. Airflow fetches the designated image.
  2. Initiates a container.
  3. Executes the given command.
  4. Requires an active Docker daemon.
1
2
3
4
5
6
7
8
DockerOperator(
    dag=dag,
    task_id='docker_task',
    image='gs://project-predict/predict-api:v1',
    auto_remove=True,
    docker_url='unix://var/run/docker.sock',
    command='python extract_from_api_or_something.py'
)

# Pros

  1. Effective for cross-functional teams.
  2. Compatible with non-Python projects.
  3. Ideal for Docker-centric infrastructures.

# Cons

  1. Requires Docker on the worker machine.
  2. High resource demand on the worker machine when running multiple containers.

# KubernetesPodOperator

The KubernetesPodOperator places business logic and code within a Docker image. During execution,

Airflow initiates a worker pod, which then retrieves and executes commands from the specified Docker image.

1
2
3
4
5
6
KubernetesPodOperator(
        task_id='classify_tweets',
        name='classify_tweets',
        cmds=['python', 'app/classify.py'],
        namespace='airflow',
        image='gcr.io/tweet_classifier/dev:0.0.1')

# Pros

  1. Facilitates collaboration across different functional teams.
  2. Enables sharing a single Airflow instance across various teams without complications.
  3. Decouples DAGs from business logic.

# Cons

Presents complexity in infrastructure due to its reliance on Docker and Kubernetes.

# Common Mistakes

Here are frequent errors observed in DevOps and Data teams when implementing Airflow:

  1. DAG Folder Location: Often, the DAG folder is part of the main Airflow infrastructure repository. This necessitates a full Airflow restart for any DAG modification, potentially causing job failures and inconvenience. Ideally, your DAG folder should be located separately, as configured in your Airflow.cfg.
  2. Local Log Folder Configuration: A common oversight I encountered around 2015 or 2016 was configuring the log folder locally. In my case, this resulted in an Ec2 instance crash due to log overload after six months. This issue persists in some setups today.
  3. Non-Backfillable DAGs: One of Airflow’s advantages is its ability to rerun failed past DAGs without disrupting the current data. Ensuring your DAGs support easy maintenance and backfilling is crucial.
  4. Lack of Scalability Preparation: Initially, as you deploy a few DAGs, you might not notice scalability issues. However, as the number of DAGs increases (20, 30, 100+), you’ll observe longer wait times in the scheduling phase. While adjusting configurations in your airflow.cfg might help, involving DevOps for scalability solutions might become necessary.

For a comprehensive discussion, refer to Mistakes I Have Seen When Data Teams Deploy Airflow.

Read more on GitHub - jghoman/awesome-apache-airflow: Curated list of resources about Apache Airflow.


Origin:
References: Data Orchestrators Dagster Apache Airflow Wiki Why using Airflow Features of Airflow, OLAP
Created: