5/5 - (1 vote)

Dataset preparation

The data

The dataset for this assignment is the Breast Cancer Wisconsin. It contains 699 examples described by 9 numeric attributes. There are two classes class1 and class2. The features are computed from a digitized image of a fine needle aspirate of a breast mass of the subject. Benign breast cancer tumours correspond to class1 and malignant breast cancer tumours correspond to class2.

The dataset should be downloaded from Canvas: breast-cancer-wisconsin.csv. This file includes the attribute (feature) headings and each row corresponds to one individual. Missing attributes in the dataset are recorded with a ?.

Data pre-processing

You will need to pre-process the dataset, before you can apply the classification algorithms. Three types of pre-processing are required:

The missing attribute values should be replaces with the mean value of the column using impute.SimpleImputer.
Normalisation of each attribute should be performed using a min-max scaler to normalise the values between [0,1] with preprocessing.MinMaxScaler.
The classes class1 and class2 should be changed to 0 and 1
The value of each attribute should be formatted to 4 decimal places using .4f.

The correctness of your pre-processed data will be automatically tested by PASTA. Thus, you need to make sure that you follow the input and output instructions carefully.

2. Classification algorithms with 10-fold cross-validation

You will now apply multiple classifiers to the pre-processed dataset, in particular: Nearest Neighbor, Logistic Regression, Nave Bayes, Decision Tree, Bagging, Ada Boost and Gradient Boosting. All classifiers should use the sklearn modules from the tutorials. All random states in the classifiers should be set to random_state=0.

You need to evaluate the performance of these classifiers using 10-fold cross validation from sklearn.model_selection.StratifiedKFold with these options: cvKFold=StratifiedKFold(n_splits=10, shuffle=True, random_state=0)

For each classifier, write a function that accepts the required input and that outputs the average crossvalidation score: def exampleClassifier(X, y, [options]):

return scores, scores.mean()

where X contains the attribute values and y contains the class (as in the tutorial exercises).

More specifically, the headers of the functions for the classifiers are given below:

K-Nearest Neighbour

def kNNClassifier(X, y, K)

It should use the KNeighboursClassifier from sklearn.neighbours.

Logistic Regression

def logregClassifier(X, y)

It should use LogisticRegression from sklearn.linear_model.

Nave Bayes

def nbClassifier(X, y)

It should use GaussianNB from sklearn.naive_bayes

Decision Tree

def dtClassifier(X, y)

It should use DecisionTreeClassifier from sklearn.tree, with information gain (the entropy criterion)

Ensembles

def bagDTClassifier(X, y, n_estimators, max_samples, max_depth) def adaDTClassifier(X, y, n_estimators, learning_rate, max_depth) def gbClassifier(X, y, n_estimators, learning_rate)

These functions should implement Bagging, Ada Boost and Gradient Boosting using BaggingClassifier, AdaBoostClassifier and GradientBoostingClassifier from sklearn.ensemble. All should combine Decision Trees with information gain.

3. Parameter Tuning

For two other classifiers, Linear SVM and Random Forest, we would like to find the best parameters using grid search with 10-fold stratified cross validation (GridSearchCV in sklearn).

The split into training and test subsets should be done using train_test_split from sklearn.model_selection with stratification and random_state=0 (as in the tutorials but with random_state=0).

Hint: You need to pass StratifiedKFold as an argument to GridSearchCV, not cv=10 as in the tutorials. This ensures that random_state=0.

Write the following functions:

Linear SVM

def bestLinClassifier(X,y)

It should use SVC from sklearn.svm.

The grid search should consider the following values for the parameters C and gamma: C = {0.001, 0.01, 0.1, 1, 10, 100} gamma = {0.001, 0.01, 0.1, 1, 10, 100}

The function should print the best parameters found, the best-cross validation accuracy score and the best test set accuracy score, see Section 4.

Random Forest

def bestRFClassifier(X,y)

It should use RandomForestClassifier from sklearn.ensemble with information gain and max_features set to sqrt.

The grid search should consider the following values for the parameters n_estimators and max_leaf_nodes:

n_estimators = {10, 30} max_leaf_nodes = {4, 16}

The function should print the best parameters found, best-cross validation accuracy score and best test set accuracy score, see Section 4.

4. Submission details PASTA

This assignment is to be submitted electronically via the PASTA submission system.

How to submit to PASTA

Your program will need to be named MyClassifier. If you use Jupyter notebook to write your code, please do the following:

Ensure that you have a main method that is able to parse 3 command line arguments. PASTA will run your code with 3 different arguments. Details about the 3 arguments are given below.
Ensure that you export your Jupyter notebook as Python code and only submit the Python code to PASTA.

Before submitting to PASTA please make sure that all your Python programs (MyClassifier.py and all other supporting .py programs, if you have any) are zipped together. If you only have 1 Python program (MyClassifier.py), you do not need to zip it. Just submit that file to PASTA.

If you are zipping, please do not zip the folder containing your MyClassifier.py program (and your other supporting programs, if any). Zip only the .py files. To do this, select the MyClassifier.py file (and your other supporting scripts if any), right click, and compress/zip them. Then submit the resulting zip file to PASTA.

Detailed submission instructions:

Find Assignment 1 and press Submit (the red icon on the right). Then press Choose File and attach the file, then press I accept after reading the University policy on academic honesty and agreeing with it.
If you see a message indicating that your code is queued for testing, this means that your code has been uploaded successfully.
If you see a red error box, you need to fix any problems indicated before your submission will be accepted.
Once your program has been tested, the page will tell you to refresh for results, so refresh the page. You should see green and red boxes. Each box corresponds to a test; a red box indicates that your program has failed the respective test and a green box indicates that your program has passed the test.
If you have red boxes, you can see which tests were not passed by clicking on Assignment 1. Correct the errors (go back to Jupyter, re-write your code and test it carefully) and then submit again in PASTA.

Important:

Do not treat PASTA as your development environment! You must firstly test your code on your laptop or desktop computer to make sure it runs without errors and only after that submit to PASTA. If you dont do this, there will be too many submissions in the queue and the waiting time to run your code will become very long. In this case, we may need to limit the number of submissions per student; currently it is unlimited.
Start working on the assignment as soon as possible and do not delay the submission till the last moment as PASTA will get very busy and you risk being late.

Input

Your program should take 3 command line arguments:

The first argument is the path to the data file.
The second argument is the name of the algorithm to be executed or the option for print the preprocessed dataset:
1. NN for Nearest Neighbour.
2. LR for Logistic Regression.
3. NB for Nave Bayes.
4. DT for Decision Tree.
5. BAG for Ensemble Bagging DT.
6. ADA for Ensemble ADA boosting DT.
7. GB for Ensemble Gradient Boosting.
8. RF for Random Forest.
9. SVM for Linear SVM.
10. P for printing the pre-processed dataset
The third argument is optional, and should only be supplied to algorithms which require parameters, namely NN, BAG, ADA and GB. It is the path to the file containing the parameter values for the algorithm. The file should be formatted as a csv file like in the following examples:
1. NN (note the capital K); an example for 5-Nearest Neighbour:

BAG:

n_estimators,max_samples,max_depth

100,100,2

ADA:

n_estimators,learning_rate,max_depth

100,0.2,3

n_estimators,learning_rate

100,0.2

For algorithms which do not require any parameters (LR, NB, DT, RF, SVM, P), the third argument should not be supplied.

The file paths (the first and third arguments) represent files that will be supplied to your program for reading. You can test your submission using any files you like, but PASTA will provide your submission with its own files for testing, so do not assume any specific filenames.

Your program must be able to correctly infer X and y from the file. Please do not hard-code the number of features and examples. For the last few tests we will use a dataset with different number of features and examples than the given breast cancer dataset. The new dataset will contain a header line and the last column will correspond to the class, as the given breast cancer dataset.

The following examples show how your program would be run:

We want to run the k-Nearest Neighbour classifier, the data is in a file called breast-cancerwisconsin-normalised.csv, and the parameters are stored in a file called csv:

python MyClassifier.py breast-cancer-wisconsin-normalised.csv NN param.csv

We want to run Nave Bayes and the data is in a file called breast-cancer-wisconsinnormalised.csv.

python MyClassifier.py breast-cancer-wisconsin-normalised.csv NB

We want to run the data pre-processing task and the data is in a file called breast-cancerwisconsin.csv:

python MyClassifier.py breast-cancer-wisconsin.csv P

Only option P assumes non-pre-processed input data (and it needs to pre-process and print this data). All other options (NN, LR, etc) assume already pre-processed data (i.e. normalized, scaled, missing values replaced, etc) and dont need to do any data pre-processing.

Output

Your program will output to standard output (i.e. the console).

Pre-processing task

For option P, you will need to output your pre-processed data onto the console using the print command.

Lets assume your normalised data looks like the following:

ClumpThickness	Uniformi ty of Cell Size	Uniformity of CellShape	Marginal Adhesion	SingleEpithelialCell Size	Bare Nuclei	Bland Chromati n	Normal Nucleoli	Mitoses	Class
0.1343	0.4333	0.5432	0.8589	0.3737	0.9485	0.4834	0.9456	0.4329	0
0.1345	0.4432	0.4567	0.4323	0.1111	0.3456	0.3213	0.8985	0.3456	1
0.4948	0.4798	0.2543	0.1876	0.9846	0.3345	0.4567	0.4983	0.2845	0

Then your program output should look like this:

0.1343,0.4333,0.5432,0.8589,0.3737,0.9485,0.4834,0.9456,0.4329,0 0.1345,0.4432,0.4567,0.4323,0.1111,0.3456,0.3213,0.8985,0.3456,1

0.4948,0.4798,0.2543,0.1876,0.9846,0.3345,0.4567,0.4983,0.2845,0

You must print the feature values to 4 decimal places as in the example above, e.g. 0.1 should be printed as 0.100. The class value is an integer.

Note also that there is no new line character after the last line. To print the last line, ensure you pass empty character as the end parameter for the print statement.

print(data, end=)

Classifiers with 10-fold cross validation task

Your classifier should only print the mean accuracy score with precision of 4 decimal places using .4f. Note that .4f does rounding too. For example, if your mean accuracy is 0.987689, your program output should look as follows:

0.9877

Note: There is no new line character after the accuracy. See above how to print like this.

Parameter tuning task

For Linear SVM, your program should output exactly 4 lines. The first line contains the optimal C value, the second line contains the optimal gamma value, and the third line contains the best cross-validation accuracy score formatted to 4 decimal places using .4f and the fourth line contains the test set accuracy score also formatted to 4 decimal places. For instance, if the optimal C and gamma values are 0.001 and 0.1 respectively, and the two accuracies are 0.999874 and 0.952512, your program output should look like this:

0.001

0.1

0.9999

0.9525

For Random Forest, your program should output exactly 4 lines. The first line contains the optimal n_estimators, the second line contains the optimal max_leaf_nodes, the third line contains the best cross validation accuracy score truncated to 4 decimal places using .4f and the fourth line contains the test set accuracy score also truncated to 4 decimal places. For instance, if the optimal n_estimators and max_leaf_nodes are 10 and 4 respectively, and the two accuracies are 0.997623 and 0.87243, your program output should look like this:

0.9976

0.8724

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Whatsapp Us

[Solved] COMP5318 Assignment 1-Classification