This lab demonstrates GCP services like Datalab, Dataflow and BigQuery for implementing data analysis and preprocessing for machine learning.
Google Cloud Platform (GCP), offered by Google, is a suite of cloud computing services that runs on the same infrastructure that Google uses internally for its end-user products, such as Google Search, Gmail, file storage, and YouTube. Alongside a set of management tools, it provides a series of modular cloud services including computing, data storage, data analytics and machine learning.
Google Cloud Platform provides infrastructure as a service, platform as a service, and serverless computing environments.
Cloud Datalab is a powerful interactive tool created to explore, analyze, transform and visualize data and build machine learning models on Google Cloud Platform. It runs on Google Compute Engine and connects to multiple cloud services easily so you can focus on your data science tasks.
Google Cloud Dataflow is a cloud-based data processing service for both batch and real-time data streaming applications. It enables developers to set up processing pipelines for integrating, preparing and analyzing large data sets, such as those found in Web analytics or big data analytics applications.
Storing and querying massive datasets can be time consuming and expensive without the right hardware and infrastructure. BigQuery is an enterprise data warehouse that solves this problem by enabling super-fast SQL queries using the processing power of Googles infrastructure.
Dataset
We used public Natality dataset to create an ML model to predict a babys weight given a number of factors about the pregnancy and the babys mother.
We cloned the https://github.com/GoogleCloudPlatform/training-data-analyst github path and used training-data-analyst/blogs/babyweight/babyweight.ipynb notebook for our data processing and model creation.
Experiment Setup
Prerequisites
- In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.
- Enabled the BigQuery, AI Platform, Cloud Source Repositories, Dataflow, and Datalab APIs.
Launching Datalab
Following are the steps to launch Datalab:
- Open Cloud Shell Editor
- Retrieved Google Cloud Project id using the below command
- Created a Datalab instance
- Connection was established and instance of Datalab created at port 8081
Cloning Datalab Notebook
- Opened a new notebook and copy pasted below command
!git clone https://github.com/GoogleCloudPlatform/training-data-analyst
Use Cases
Data Exploration, Preprocessing and Visualization in Datalab
Project ID and Bucket setup in notebook
- In the first cell, set the variable PROJECT to your project ID.
- Set the variable BUCKET to your bucket name in the first cell. For your bucket name, use your project ID as a prefix and my-bucket: project-ID-my-bucket
- Leave REGION as us-central1.
Fetching data in dataframe using BigQuery
Visualizing count and average of babies
Visualizing correlation between Mothers age and number of babies and there average weight
Visualizing plurality trends
Correlation between gestation period and babies weight and count
Preprocessing using apache beam
We modified the data such that we can simulate what is known if no ultrasound has been performed. If I didnt need preprocessing, I could have used the web console. Also, I prefer to script it out rather than run queries on the user interface. Therefore, I am using Cloud Dataflow for the preprocessing.
Results
Job took around 48mins to finish
Throughput Metrices
CPU Utilization
Data loaded in the GCP bucket in csv form
Lesson Learned
- Learned to set up a project in Google Cloud and instantiate DataLab from Cloud shell editor using shell commands.
- Data exploration in Datalab and writing BigQuery to load data in dataframe and visualization.
- Preprocessing of data using Dataflow
References
https://en.wikipedia.org/wiki/Google_Cloud_Platform
Reviews
There are no reviews yet.