5/5 - (1 vote)

This is the third homework assignment to get you familiar with the machine learning tools. Were introducing another type of machine learning algorithm, which is unsupervised learning. Specifically, we are performing outlier detection with unsupervised learning.

Recall that you went to the employee party last Friday, and you won the wine tasting competition. As a reward, you earned a week of paid vacation. But, due to COVID-19, you have nowhere to go so you are just laying on the couch, watching Netflix. All of sudden you receive a message from your NBA team manager friend and the conversation goes like this:

1 Dataset

Ensure youve download the latest version of the data set from Piazza

We are using data from the Basketball Reference website, a data set related to NBA player in the 2020-2021 season. For more details, consult:

https://www.basketball-reference.com/leagues/NBA_2021_per_game.html .

The file is provided in a format known as Comma Separated Values (CSV). You can open it as raw text to take a look! The data have 30 columns of features describing the players. These fields come in strings, integers, and floating-point numbers. This data set does have missing values.

This dataset doesnt contain any labels because it is meant for unsupervised learning.

You can use any method to load in the CSV file. Refer to the Getting Started guide if you are not sure about this step.

2 Recommended Flow

Make a function to load the data.
Clean the dataset so that all string entries and missing values are convertedto floating-point numbers.
Make a function to plot the data, to see the distribution of each feature.(Hint: Histogram and Cumulative distribution function (CDF).) Identify three important features. (Hint: If you dont know which features are important, you can try the 3 features mentioned in the introduction.) Approximate the percentage of outlier samples.
Make a function to alter/preprocess the data (delete the features that arenot important, normalize the data, and so-on)
Train an Outlier Detection model.
Score each player based on the models outlier score. Identify the topthree outliers.
Verify the top three outliers by plotting the players feature values ontothe graphs from step 3. Check the location of the top 3 outliers with a 3-D scatter plot (the axes of the plot should be the three selected features from step 3.)
Save the outlier scores and the plots used to verify the top three outliers.
Repeat the previous steps until satisfied.

3 Problem Formulation

Input	Data Type	Description
Rk	Integer	Rank Based on Last Name
Player	String	Name of the NBA Player
Pos	String	Position of the NBA Player
Age	Integer	Age of the NBA Player
Tm	String	Team of the NBA Player
G	Integer	Number of Games Played
GS	Integer	Number of Games Started with the Player on the Court
MP	Floating-Point	Minutes Played Per Game
FG	Floating-Point	Field Goals Per Game
FGA	Floating-Point	Field Goals Attempts Per Game
FG%	Floating-Point	Field Goal Percentage
3P	Floating-Point	3-Point Field Goals Per Game
3PA	Floating-Point	3-Point Field Goals Attempts Per Game
3P%	Floating-Point	3-Point Field Goal Percentage
2P	Floating-Point	2-Point Field Goals Per Game
2PA	Floating-Point	2-Point Field Goal Attempts Per Game
2P%	Floating-Point	2-Point Field Goal Percentage
eFG%	Floating-Point	Effective Field Goal Percentage
FT	Floating-Point	Free Throws Per Game
FTA	Floating-Point	Free Throws Attempts Per Game
FT%	Floating-Point	Free Throw Percentage
ORB	Floating-Point	Offensive Rebounds Per Game
DRB	Floating-Point	Defensive Rebounds Per Game
TRB	Floating-Point	Total Rebounds Per Game
AST	Floating-Point	Assists Per Game
STL	Floating-Point	Steals Per Game
BLK	Floating-Point	Blocks Per Game
TOV	Floating-Point	Turnovers Per Game
PF	Floating-Point	Personal Fouls Per Game
PTS	Floating-Point	Points Per Game

Your task is to build an outlier model to identify the outlier players in the given CSV. An outlier player can be an extremely good player or an extremely bad player depending on which side of the spectrum the players stats lie on the distribution graphs.

You are required to use the Python scikit-learn library to construct your models. You are required to use the following three methods:

One Class Support Vector Machine (SVM)
Elliptic Envelope
Isolation Forest

However, you can experiment with any other algorithms you find interesting! Link to the documentation for more methods is available on (Outlier Detection in SciKit Learn)

For every algorithm, train it on the data and give an outlier score to each player. Identify the outlier players, based on the outlier models scores. Then, pick the top three outliers and verify that they are indeed an outlier by using the CDF plot and the 3D scatter plot. Make a new CSV file, MODEL NAME Scores.csv, with two columns, the players name and the players outlier score. The CSV should be sorted from min to max based on the players outlier score. Submit the CSV files with your report.

You also must write a brief report answering the following questions: Describe in your own words what are Outlier Detection and Novelty Detection. And, how are they different? (Hint: training samples) Does our current problem belong to Outlier Detection or Novelty Detection and why?

Explain the data preprocessing/transformation methods you applied.
Show the distribution of the feature values. Use histogram (Number of Occurrences vs Feature Values) and CDF (Percentage of Player vs Feature Values). Pick three features to train your models. Approximately, how many percent of the players are outliers, and how did you come to that conclusion based on the plots?
Describe in your own words how each of the three algorithms creates a model. How do the model decide the boundary between inliers and outliers? How are the outlier scores calculated? Explain your choice of hyperparameters if any.
For each algorithm, pick the top three outliers, verify that they are indeed an outlier, and check whether the player is an outlier due to being bad or being good Explain your reasoning with the models outlier scores and the CDF plots.
For each algorithm, show a 3-D scatter plot marking the inliers, outliers, and the top three outliers data points. Describe how the inlier region different from the outlier region.
Could this model be used to predict outlier players in the next season? Justify your answer and state your assumptions.

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Whatsapp Us

[Solved] ECE272A Homework 3

1 Dataset

2 Recommended Flow

3 Problem Formulation

Reviews

Whatsapp Us

[Solved] ECE272A Homework 3

1 Dataset

2 Recommended Flow

3 Problem Formulation

Reviews

Related products

[Solved] ECE272A Homework 1

[Solved] ECE272A Homework 2