Awesome Apache Airflow
This is a curated list of resources about Apache Airflow. Please feel free to contribute any items that should be included. Items are generally added at the top of each section so that more fresh items are featured more prominently.
Contents
- Vital links
- Airflow deployment solutions
- Introductions and tutorials
- Airflow Summit 2020 Videos
- Best practices, lessons learned and cool use cases
- Books, blogs, podcasts, and such
- Slide deck presentations and online videos
- Libraries, Hooks, Utilities
- Meetups
- Commercial Airflow-as-a-service providers
- Cloud Composer resources
- Non-English resources
Vital links
- Source code (latest stable release 1.10.12)
- Documentation (also the official website)
- Confluence page
- Slack workspace
Airflow deployment solutions
- Installing Airflow on IBM Cloud - Quick and easy deployment on IBM Cloud with IBM Bitnami Charts
- Three ways to run Airflow on Kubernetes - Tim van de Keer walks through several methods for deploying Airflow on Kubernetes.
- Apache Airflow Multi-Tier Free Deployment on Azure - A free Azure Resource Manager (ARM) template by Bitnami providing a one-click solution for Airflow deployment on Azure for production use-cases.
- KubernetesExecutor Helm Chart - A lean Helm Chart using the KubernetesExecutor for a more k8s native experience and complementary KubernetesExecutor Docker Image.
- Stable Celery Helm Chart - Curated Helm Chart in the official stable chart repository.
- Puckel's Docker Image - @Puckel_'s well-crafted Docker image has become the base for many Airflow installations. It is regularly updated and closely tracks the official Apache releases.
- Kubernetes Custom Operator for Deploying Airflow - Kubernetes Custom controller (also called operator pattern) for deploying Airflow on Kubernetes.
- airflow-pipeline - Airflow Docker container that comes preconfigured for Spark and Hadoop. It can be docker pulled at
datagovsg/airflow-pipeline
. - aws-airflow-stack - An AWS based Airflow cluster deployment with CeleryExecutor. Deploys after a few clicks with CloudFormation.
- kube-airflow - This repository contains both an Airflow Docker image (that appears to have been based on Puckel's work) and Kubernetes service definition. mumoshu's repository has not been recently updated, but there are numerous forks that may be based on more recent releases.
- airflow-on-kubernetes - A guide on all relevant resources, scripts and projects that relate to running Airflow on Kubernetes.
- airflow-k8s-executor-on-GKE - A detailed tutorial to get a scalable, low maintenance airflow kubernetes executor environment deployed on Google Kubernetes Engine with helm.
- airflow-cookbook - Chef cookbook for deploying Airflow.
- Running Airflow on top of Apache Mesos - Blog describing how to configure Mesos to run all of the Airflow componenents.
- Integrating Apache Airflow with Apache Ambari - Mykola Mykhalov walks through using Apache Ambari to configure and deploy an Airflow instance.
- Astronomer Platform - Apache Airflow as a Service on Kubernetes. For more information visit https://www.astronomer.io.
- Bitnami Airflow Docker image - A secure and up-to-date docker image for Airflow maintained by Bitnami.
- Bitnami Airflow Scheduler Docker image - A secure and up-to-date docker image for Airflow Scheduler maintained by Bitnami.
- Bitnami Airflow Worker Docker image - A secure and up-to-date docker image for Airflow Worker maintained by Bitnami. A CeleryExecutor docker-compose deployment is available here.
- Distribute & deploy Apache Airflow via Python PEX files - Example repo with steps to bundle, distribute, & deploy Apache Airflow as PEX files.
- Introducing KEDA for Airflow - How to use KEDA scaler system to enable autoscaling of celery workers based on data stored in the Airflow metadata database.
- Airflow-Component - Lightweight installer of federated Airflow-Airflow (RabbitMQ) reference architectrure on Compute node(s).
Introductions and tutorials
- Apache Airflow Monitoring Metrics - A two-part series by maxcotec on how you can utilize existing Airflow statsd metrics to monitor your airflow deployment on Grafana dashboard via Prometheus. Also learn how to create custom metrics.
- Introduction to Airflow - A web tutorial series by maxcotec for beginners and intermediate users of Apache Airflow.
- ETL with Apache Airflow for Data Analysis on Transaction Data. Kimaru Thagana covers a practical case of doing an ETL process using Apache Airflow using a dummy ecommerce store's transactional, user and product data. The data is served via a flask API.
- Start Building Better Data Pipelines With apache Airflow 2020-Oct - Naman Gupta covers the basics of Airflow and its concepts.
- Airflow Repository Template - A boilerplate repository for developing locally with Airflow, with linting & tests for valid DAGs and plugins. Just clone and run
make start-airflow
to get started! Add some CI jobs to deploy your code and you're done. - How Apache Airflow Distributes Jobs on Celery workers - A short description of the steps taken by a task instance, from scheduling to success, in a distributed architecture.
- Remote spark-submit to YARN running on EMR - Azhaguselvan walks through submitting Spark jobs to existing EMR clusters with Airflow.
- Running Airflow on top of Apache Mesos and its follow-up, Mesos, Airflow & Docker by Agraj Mangal is a quick overview of running Airflow atop Apache Mesos.
- Dustin Stansbury of Quizlet has written a four-part series that covers what workflow managers do in general, how Quizlet picked Airflow, a tour of Airflow's key concepts, and how Quizlet is now using Airflow in practice:
- Integrating Apache Airflow with Databricks - While this tutorial is focused specifically on Databricks' Spark solutions, it does have a reasonable overview of Airflow basics and demonstrates how a third party solution can quickly integrate into Airflow.
- Apache Airflow 2.0 Tutorial - This article discusses the basic concepts that stand behind Airflow and discusses the problems it solves.
- Testing and debugging Apache Airflow - Article explaining how to apply unit testing, mocking and debugging to Airflow code.
- Get started developing workflows with Apache Airflow - This brief introductory tutorial covers how to create data pipeline and processing workflow using DAG, operators, Sensor, using Xcoms to communicate between operators.
- Get started with Airflow + Google Cloud Platform + Docker - Step-by-step introduction by Jayce Jiang.
- How to develop data pipeline in Airflow through TDD (test-driven development) - Learn how to build a sales data pipeline using TDD step-by-step and in the end how to configure a simple CI workflow using Github Actions.
Airflow Summit 2020 videos
The first Airflow Summit 2020 was held in July 2020. It was a truly global, fully online event that was co-hosted by 9 Airflow Meetups from all over the world (Melbourne, Tokyo, Bangalore, Warsaw, Amsterdam, London, NYC, BayArea).
It featured 40+ talks and three workshops. You can check out the talk recordings as a YouTube Airflow Summit 2020 Playlist or see the individual talks here:
- Keynote: Airflow then and now
- Scheduler as a service - Apache Airflow at EA Digital Platform
- Keynote: How large companies use Airflow for ML and ETL pipelines
- Data DAGs with lineage for fun and for profit
- Airflow on Kubernetes: Containerizing your workflows
- Data flow with Airflow @ PayPal
- Democratised data workflows at scale
- Migrating Airflow-based Spark jobs to Kubernetes - the native way
- Keynote: Future of Airflow
- Run Airflow DAGs in a secure way
- Keynote: Making Airflow a sustainable project through D&I
- Airflow CI/CD: Github to Cloud Composer (safely)
- Advanced Apache Superset for Data Engineers
- Demo: Reducing the lines, a visual DAG editor ![Activity