However, contemporary architectures demand more. The value of code and data transformation logic now extends beyond their immediate functional use, proving essential to other data-informed individuals within an organization.
I highly recommend delving into Maxime Beauchemin’s piece on Functional Data Engineering — a modern paradigm for batch data processing for a deeper understanding of modern data pipelines.
# What is Apache Airflow?
Apache Airflow™ is an open-source platform for developing, scheduling, and monitoring batch-oriented workflows.
Airflow’s extensible Python framework enables you to build workflows connecting with virtually any technology. A web interface helps manage the state of your workflows. Airflow is deployable in many ways, varying from a single process on your laptop to a distributed setup to support even the biggest workflows.
Apache Airflow Core, which includes webserver, scheduler, CLI and other components that are needed for minimal Airflow installation. Read the documentation »
# Paradigms: Workflows as code
The main characteristic of Airflow workflows is that all workflows are defined in Python code. “Workflows as code” serves several purposes:
- Dynamic: Airflow pipelines are configured as Python code, allowing for dynamic pipeline generation.
- Extensible: The Airflow™ framework contains operators to connect with numerous technologies. All Airflow components are extensible to easily adjust to your environment.
- Flexible: Workflow parameterization is built-in leveraging the Jinja templating engine.
# Why Airflow?
The Airflow framework contains operators to connect with many technologies and is easily extensible to connect with a new technology. If your workflows have a clear start and end, and run at regular intervals, they can be programmed as an Airflow DAG.
If you prefer coding over clicking, Airflow is the tool for you. Workflows are defined as Python code which means:
- Workflows can be stored in version control so that you can roll back to previous versions
- Workflows can be developed by multiple people simultaneously
- Tests can be written to validate functionality
- Components are extensible and you can build on a wide collection of existing components
Rich scheduling and execution semantics enable you to easily define complex pipelines, running at regular intervals. Backfilling allows you to (re-)run pipelines on historical data after making changes to your logic. And the ability to rerun partial pipelines after resolving an error helps maximize efficiency.
Read more Airflow Documentation .
# We are all using Airflow wrong
- Insightful read: We’re All Using Airflow Wrong and How to Fix It | by Jessica Laughlin | Bluecore Engineering | Medium
# Airflow Commands
To obtain the current executors (Supported:
# Airflow Operators
The Airflow PythonOperator is optimal when the business logic and code are housed within the Airflow DAG directory. The
PythonOperator facilitates the import and execution of these components.
- Ideal when the code is in the same repository as Airflow.
- User-friendly and straightforward.
- Efficient for smaller teams.
- Tightly couples Airflow code with business logic.
- Changes in business logic necessitate Airflow code redeployment.
- Sharing a single Airflow instance across multiple projects becomes challenging.
- Limited to Python code.
The DockerOperator is becoming obsolete. It’s recommended to opt for the . As highlighted in this StackOverflow discussion, “The real answer is to use the
KubernetesPodOperator. DockerOperator will soon lose its functionality with the phasing out of dockershim.”
The DockerOperator in Airflow manages business logic and code within a Docker image. Upon execution:
- Airflow fetches the designated image.
- Initiates a container.
- Executes the given command.
- Requires an active Docker daemon.
- Effective for cross-functional teams.
- Compatible with non-Python projects.
- Ideal for Docker-centric infrastructures.
- Requires Docker on the worker machine.
- High resource demand on the worker machine when running multiple containers.
KubernetesPodOperator places business logic and code within a Docker image. During execution,
Airflow initiates a worker pod, which then retrieves and executes commands from the specified Docker image.
- Facilitates collaboration across different functional teams.
- Enables sharing a single Airflow instance across various teams without complications.
- Decouples DAGs from business logic.
Presents complexity in infrastructure due to its reliance on Docker and Kubernetes.
# Common Mistakes
Here are frequent errors observed in DevOps and Data teams when implementing Airflow:
- DAG Folder Location: Often, the DAG folder is part of the main Airflow infrastructure repository. This necessitates a full Airflow restart for any DAG modification, potentially causing job failures and inconvenience. Ideally, your DAG folder should be located separately, as configured in your Airflow.cfg.
- Local Log Folder Configuration: A common oversight I encountered around 2015 or 2016 was configuring the log folder locally. In my case, this resulted in an Ec2 instance crash due to log overload after six months. This issue persists in some setups today.
- Non-Backfillable DAGs: One of Airflow’s advantages is its ability to rerun failed past DAGs without disrupting the current data. Ensuring your DAGs support easy maintenance and backfilling is crucial.
- Lack of Scalability Preparation: Initially, as you deploy a few DAGs, you might not notice scalability issues. However, as the number of DAGs increases (20, 30, 100+), you’ll observe longer wait times in the scheduling phase. While adjusting configurations in your airflow.cfg might help, involving DevOps for scalability solutions might become necessary.
For a comprehensive discussion, refer to Mistakes I Have Seen When Data Teams Deploy Airflow.