This lab demonstrates how to create a training pipeline that aims to identify the type of Acne-Rosacea, by training a model with images scraped from dermnet.com with a confidence score. The front-end application uses Streamlit to predict using the trained model.
Orchestration with Apache Airflow
Airflow is a platform to programmatically author, schedule and monitor workflows.
In Airflow, a DAG or a Directed Acyclic Graph is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies. Use Airflow to author workflows as Directed Acyclic Graphs (DAGs) of tasks. The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap.
The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed. When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative.
Dataset
The dataset used for this lab is from Dermnet. Dermnet is the largest independent photo dermatology source dedicated to online medical education though articles, photos and video. Dermnet provides information on a wide variety of skin conditions through innovative media. The following are the list of skin problems for which photo dictionary is available:
Experiment Setup
The following are the prerequisite setup we made for the implementation of lab:
- Install the dependencies as outlined in the requirements.txt by running
pip install -r requirements.txt
- Install Airflow in the virtual environment
pip install apache-airflow
- Change the bucket name in s3_uploader/upload_models.py
- Configure airflow using following commands:
Use the current directory as $AIRFLOW_HOME
export AIRFLOW_HOME=/home/bigdata/Documents/PyCharmProjects/airflow_cnn_pipeline
Initialize the database
airflow db init
Create credentials to access airflow server
airflow users create
username admin
firstname YourName
lastname YourLastName
role Admin
email [email protected]
Starting the Webserver to access Airflow server
airflow webserver -D
Before running the scheduler, make sure your DAG code is within the dags folder, in our case train_model.py has our DAG code and hence it should be inside the dags folder. If we have already started the scheduler, get the pid using following command and then kill it using kill -9 <pid>
lsof -i tcp:8080
Once these configurations are done, start the airflow scheduler
airflow scheduler
Test Cases
- After we login to the Airflow webserver on http://localhost:8080/login/, we can see the CNN-Training-Pipeline in the list of DAGs, we can check the graph view to a detailed view of the workflow tasks. We further trigger the workflow to start running the sequenced tasks:
Airflow chains all the individual processes (tasks). The pipeline is scheduled to run at a predefined cadence and is constantly retraining the model using scraped data and continuously upload the trained graph and labels to S3
Task 1: UploadModels
This task uploads the retrained graph (retrained_graph_v2.pb) and label (retrained_labels.txt) from our system to AWS S3 Bucket mentioned in the bucket_name using boto3 service inside /model folder
Task 2: ScrapeData
This task scrapes data from dermnet.com and downloads the scraped images in ScrapedData-Acne-and-Rosacea-Photos directory using BeautifulSoup in get_data() function
Task 3: Cleanup
This task cleans all the empty directories in ScrapedData-Acne-and-Rosacea-Photos folder which does have any images in it.
Task 4: TrainModel
This task trains the retrained model uploaded in S3 with the newly scraped images from dermnet.com
Task 5: UploadModelsPostTraining
This task uploads the newly retrained model with scraped data to back to S3
Results
Once the airflow DAG is successfully completed individual tasks are highlighted in dark green color as below:
Graph View:
Tree View:
We can validate our retrained model by running the streamlit app (http://localhost:8501) which calculates the confidence score for its acne condition for each new uploaded images based on the retrained model with scraped images
Lessons Learned
- Learnt how to orchestrate tasks in a pipeline using Apache Airflow
- Crawling the data from web using BeautifulSoup
- Using streamlit app as inference and validating retrained model for confident score for new images
References
https://airflow.apache.org/docs/apache-airflow/stable/index.html
https://medium.com/@dustinstansbury/understanding-apache-airflows-key-concepts-a96efed52b1a
https://towardsdatascience.com/getting-started-with-airflow-locally-and-remotely-d068df7fcb4
Reviews
There are no reviews yet.