5/5 - (1 vote)

Welcome to the first homework assignment of the quarter!

Over the course of the quarter, were going to be helping you develop tools that apply machine learning and data science to real-world problems. This first assignment starts that off with teaching the initial phases of developing a tool: Exploring the data space of your problem and writing scripts to train models.

For this assignment, you work at Insurance Corp Incorporated, a company in the medical insurance space. Youve just received an email from your boss:

From: [email protected]

To: [email protected]

Subject: New development direction

Hey, legal has just approved a new diabetes study for use in our tools. Id like you to take a dig through it, try to make some classifiers to predict patient outcomes from the diagnostic data.

Since were only prototyping for now, you can just load and save the data in CSV, manipulate it in Numpy, plot in MatplotLib, and crib the classifiers from Sklearn.

Ive attached the study data to this email. Ive also pulled some files for patients who are due for tests before the due date Ive set on this data exploration. See if you can predict how those tests are going to turn out early.

When youve got a handle on the modeling, get me a written report on which classifier model you think is our best bet. I hear legals getting into a fight with the government over some of our models not being explainable so, try a DecisionTree and, if thats not good enough, get back on why.

Hope to see you in the happy hour Zoom on friday!

We are not planning a real Zoom happy hour; the above email is fictional.

2 Dataset

We are using data from the Pima Indians Diabetes Database, a dataset of medical history metrics and diabetes outcomes who got diabetes, who did not. Were making use of a subset, narrowed to females of at least 21 years of age.

The file is provided in a format known as Comma Separated Values (CSV). You can open it as raw text to take a look! The data columns we have are Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age, and Outcome. All of these fields come in numbers, but not all fields have data. Where the data is missing, the field has been left at 0.

Outcome is the one you want to predict! If its a 1, the patient had or developed diabetes during the period of the study. If its a 0, they did not.

To load our data, well use Pythons built-in csv module. The shortest snippet to load the whole file is:

import csv with open(diabetes.csv,r) as file: reader = csv.reader(file) column_headers = next(reader) data_rows = list(reader)

Youre not required to use this snippet. If this is at all confusing, we discuss loading and preprocessing the data in more depth in the Getting Started guide.

3 Recommended Flow

Overall, we suggest that you

Make a function to load your data and split it into samples and labels
Make a function to plot your data, to see what it looks like.
Make a function to alter/preprocess your data (if you think you can feature engineer it!)
Make a function to split the given data between a set of training data andset of validation data (See: train test split on SciKit Learn)
Train a classifier (See: Supervised Learning on SciKit Learn)
Measure its performance (See: Cross-Validation on SciKit Learn)
Save plots/metrics of its performance
Repeat the previous three steps until satisfied
Predict the data labels in csv using a trained classifier

4 Problem Formulation

	Input			Range	Description
	Pregnancies			[0,)	Pregnancies the patient has had
	Glucose			[0,)	Blood glucose level
	BloodPressure			[0,)	Blood pressure
	SkinThickness			[0,)	Thickness of the skin
	Insulin			[0,)	Blood insulin level
	BMI			[0,)	Body-mass index
	DiabetesPedigreeFunction			[0,1)	Familial history of diabetes
	Age			[0,)	Patient age
Output		Range	Description
Outcome		0 or 1	Prediction for whether the patient will get diabetes

Your task is to create a classifier that converts diagnostic and historic data about patients into a prediction for whether or not they will develop diabetes.

You are required to use the Python scikit-learn library to construct your models. You are required to use the DecisionTree classifier and at least two of the following others:

Linear Discriminant Analysis (LDA)
Nave Bayes
Nearest Neighbors
Support Vector Machine (SVM)

Not all of these methods can achieve good results! Links to the documentation for each of these classifiers is available on Supervised Learning in SciKit Learn)

After you have trained your algorithms and selected the one you think is best, train it on the whole training set, then predict on the data in unknowns.csv. Make a new CSV file, score.csv, with only one column, the predicted outcomes, and submit it to the autograded dropbox. Submit your report to the nonautograded dropbox.

You also must write a brief report answering the following questions:

Which algorithm did you decide was best?
Describe in your own words how each algorithm you used classifies patients.
Some models require setting hyperparameters (such as the SVM tolerance and kernel function, or Nearest Neighbors number of neighbors checked.) Which hyperparameters did you have to tune? How did you decide on their values? Show at least one plot of a classifiers performance versus one of its hyperparameters.
When choosing your final model, what percentage split did you give beteween training and validation data? Why did you make that choice? Show at least one scatterplot marking mispredicted datapoints.
Show a diagram of your DecisionTree classifiers decision function. Does this decision function provide any hints about risk factors for diabetes?
Show all tested classifiers results using confusion matrices over your validation set. Which models overfit to the data? Underfit? Which had the best accuracy?

The report need not be excessively long. If youre spending more time putting together the report than you spent playing with the algorithms and data, feel free to drop by TA office hours for clarification on what were looking for!

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Whatsapp Us

[Solved] ECE272A Homework 1

2 Dataset

3 Recommended Flow

4 Problem Formulation

Reviews

Whatsapp Us

[Solved] ECE272A Homework 1

2 Dataset

3 Recommended Flow

4 Problem Formulation

Reviews

Related products

[Solved] ECE272A Homework 2

[Solved] ECE272A Homework 3