This lab demonstrates the functionalities of Airflow to programmatically automate, author, schedule and monitor workflows.
- Airflow is a platform to programmatically automate, schedule and monitor workflows.
- In Airflow, a DAG or a Directed Acyclic Graph is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies.
- The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies.
- Rich command line utilities make performing complex surgeries on DAGs a snap.
- The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.
Experiment Setup
- Create a new project and install the required dependencies.
pip install apache-airflow
- EXPORT AIRFLOW_HOME
Enter the path of the present working directory
- Initialize the instance
airflow db init
- Create an admin user
airflow users create
username admin
firstname YourName
lastname YourLastName
role Admin
email [email protected]
- Start the Daemon in the background.
airflow webserver -D
It usually runs on port 8080
- To check whether Airflow Daemon is running:
List the services running on port 8080
lsof -i tcp:8080
- Start the scheduler
airflow scheduler
Check the web server on 127.0.0.1:8080
- Create folder dags inside AIRFLOW_HOME
Place the DemoDag python file under the dags folder.
- Kill and Start the scheduler again to show the dags on the web server
lsof -i tcp:8080
Kill the pid of the running services on port 8080
- Start the web server again by airflow webserver -D
- The file can now be seen under the dags folder.
- Dags can be scheduled and run every minute or hourly/daily
- You can also pause/unpause the dag depending on the requirement
- Trigger your dag
- Check the logs for additional information
- Adding tasks to a DAG
Adding task_2 by making changes in the code and clicking the Update button
Checking logs for additional information
- Restructuring the code
Airflow TFX
- Installing requirements.txt
- You can now see the taxi_pipeline dag in the dags folder.
3. The Tree view looks something like this:
- Successful execution of taxi pipeline.
- ExampleGen ingests and splits the input dataset.
- StatisticsGen calculates statistics for the dataset.
- SchemaGen SchemaGen examines the statistics and creates a data schema.
- ExampleValidator looks for anomalies and missing values in the dataset.
Lessons learned
- This lab helps us understand how Airflow allows users to create workflows with high granularity and track the progress as they execute.
- Airflow enables us to have a platform that can run and automate all the jobs on a schedule.
- You can also add/transform jobs as and when required.
Reviews
There are no reviews yet.