This is the third homework assignment to get you familiar with the machine learning tools. Were introducing another type of machine learning algorithm, which is unsupervised learning. Specifically, we are performing outlier detection with unsupervised learning.
Recall that you went to the employee party last Friday, and you won the wine tasting competition. As a reward, you earned a week of paid vacation. But, due to COVID-19, you have nowhere to go so you are just laying on the couch, watching Netflix. All of sudden you receive a message from your NBA team manager friend and the conversation goes like this:
1 Dataset
Ensure youve download the latest version of the data set from Piazza
We are using data from the Basketball Reference website, a data set related to NBA player in the 2020-2021 season. For more details, consult:
https://www.basketball-reference.com/leagues/NBA_2021_per_game.html.
The file is provided in a format known as Comma Separated Values (CSV). You can open it as raw text to take a look! The data have 30 columns of features describing the players. These fields come in strings, integers, and floating-point numbers. This data set does have missing values.
This dataset doesnt contain any labels because it is meant for unsupervised learning.
You can use any method to load in the CSV file. Refer to the Getting Started guide if you are not sure about this step.
2 Recommended Flow
- Make a function to load the data.
- Clean the dataset so that all string entries and missing values are convertedto floating-point numbers.
- Make a function to plot the data, to see the distribution of each feature.(Hint: Histogram and Cumulative distribution function (CDF).) Identify three important features. (Hint: If you dont know which features are important, you can try the 3 features mentioned in the introduction.) Approximate the percentage of outlier samples.
- Make a function to alter/preprocess the data (delete the features that arenot important, normalize the data, and so-on)
- Train an Outlier Detection model.
- Score each player based on the models outlier score. Identify the topthree outliers.
- Verify the top three outliers by plotting the players feature values ontothe graphs from step 3. Check the location of the top 3 outliers with a 3-D scatter plot (the axes of the plot should be the three selected features from step 3.)
- Save the outlier scores and the plots used to verify the top three outliers.
- Repeat the previous steps until satisfied.
3 Problem Formulation
| Input | Data Type | Description |
| Rk | Integer | Rank Based on Last Name |
| Player | String | Name of the NBA Player |
| Pos | String | Position of the NBA Player |
| Age | Integer | Age of the NBA Player |
| Tm | String | Team of the NBA Player |
| G | Integer | Number of Games Played |
| GS | Integer | Number of Games Started with the Player on the Court |
| MP | Floating-Point | Minutes Played Per Game |
| FG | Floating-Point | Field Goals Per Game |
| FGA | Floating-Point | Field Goals Attempts Per Game |
| FG% | Floating-Point | Field Goal Percentage |
| 3P | Floating-Point | 3-Point Field Goals Per Game |
| 3PA | Floating-Point | 3-Point Field Goals Attempts Per Game |
| 3P% | Floating-Point | 3-Point Field Goal Percentage |
| 2P | Floating-Point | 2-Point Field Goals Per Game |
| 2PA | Floating-Point | 2-Point Field Goal Attempts Per Game |
| 2P% | Floating-Point | 2-Point Field Goal Percentage |
| eFG% | Floating-Point | Effective Field Goal Percentage |
| FT | Floating-Point | Free Throws Per Game |
| FTA | Floating-Point | Free Throws Attempts Per Game |
| FT% | Floating-Point | Free Throw Percentage |
| ORB | Floating-Point | Offensive Rebounds Per Game |
| DRB | Floating-Point | Defensive Rebounds Per Game |
| TRB | Floating-Point | Total Rebounds Per Game |
| AST | Floating-Point | Assists Per Game |
| STL | Floating-Point | Steals Per Game |
| BLK | Floating-Point | Blocks Per Game |
| TOV | Floating-Point | Turnovers Per Game |
| PF | Floating-Point | Personal Fouls Per Game |
| PTS | Floating-Point | Points Per Game |
Your task is to build an outlier model to identify the outlier players in the given CSV. An outlier player can be an extremely good player or an extremely bad player depending on which side of the spectrum the players stats lie on the distribution graphs.
You are required to use the Python scikit-learn library to construct your models. You are required to use the following three methods:
- One Class Support Vector Machine (SVM)
- Elliptic Envelope
- Isolation Forest
However, you can experiment with any other algorithms you find interesting! Link to the documentation for more methods is available on (Outlier Detection in SciKit Learn)
For every algorithm, train it on the data and give an outlier score to each player. Identify the outlier players, based on the outlier models scores. Then, pick the top three outliers and verify that they are indeed an outlier by using the CDF plot and the 3D scatter plot. Make a new CSV file, MODEL NAME Scores.csv, with two columns, the players name and the players outlier score. The CSV should be sorted from min to max based on the players outlier score. Submit the CSV files with your report.
You also must write a brief report answering the following questions: Describe in your own words what are Outlier Detection and Novelty Detection. And, how are they different? (Hint: training samples) Does our current problem belong to Outlier Detection or Novelty Detection and why?
- Explain the data preprocessing/transformation methods you applied.
- Show the distribution of the feature values. Use histogram (Number of Occurrences vs Feature Values) and CDF (Percentage of Player vs Feature Values). Pick three features to train your models. Approximately, how many percent of the players are outliers, and how did you come to that conclusion based on the plots?
- Describe in your own words how each of the three algorithms creates a model. How do the model decide the boundary between inliers and outliers? How are the outlier scores calculated? Explain your choice of hyperparameters if any.
- For each algorithm, pick the top three outliers, verify that they are indeed an outlier, and check whether the player is an outlier due to being bad or being good Explain your reasoning with the models outlier scores and the CDF plots.
- For each algorithm, show a 3-D scatter plot marking the inliers, outliers, and the top three outliers data points. Describe how the inlier region different from the outlier region.
- Could this model be used to predict outlier players in the next season? Justify your answer and state your assumptions.

![[Solved] ECE272A Homework 3](https://assignmentchef.com/wp-content/uploads/2022/08/downloadzip.jpg)

![[Solved] ECE272A Homework 1](https://assignmentchef.com/wp-content/uploads/2022/08/downloadzip-1200x1200.jpg)
Reviews
There are no reviews yet.