1 Experiments with Machine Learning
1.1 scikit-learn
For this assignment, you will use the scikit-learn machine learning framework to experiment with different machine learning algorithms and different data sets. The focus of this assignment lies more on the experimentations and analysis than on the implementation.
scikit-learn is an open-source machine learning library for Python (see http://scikit-learn.org/stable/), which provides an interface to program with a variety of different algorithms and built-in datasets. There are plenty of online documentation and examples of code online.
1.2 Data Sets
You must use the 2 datasets provided on Moodle (see the zip file DataSet-Release1). Both datasets are about the classification of black & white images of size 32 32 that represent a character. For example, the image represents the character A.
Dataset 1 contains images of the 26 uppercase letters [A Z].
Dataset 2 contains images of 10 Greek letters.
Each character in the datasets is represented by an index, as indicated in the table below:
Dataset 1 | Dataset 2 | ||||||
index | char. | index | char. | index | char. | index | char. |
0 | A | 10 | K | 20 | U | 0 | (pi) |
1 | B | 11 | L | 21 | V | 1 | (alpha) |
2 | C | 12 | M | 22 | W | 2 | (beta) |
3 | D | 13 | N | 23 | X | 3 | (sigma) |
4 | E | 14 | O | 24 | Y | 4 | (gamma) |
5 | F | 15 | P | 25 | Z | 5 | (delta) |
6 | G | 16 | Q | 6 | (lambda) | ||
7 | H | 17 | R | 7 | (omega) | ||
8 | I | 18 | S | 8 | (mu) | ||
9 | J | 19 | T | 9 | (xi) |
Each dataset is in .csv format, where each row is a data instance. Each instance is composed of 1024 binary features followed by its class (the index). Each dataset contains 3 splits:
- training: to be used for training your models.
- validation: to be used for validating/experimenting with your models.
- test: to be used to report your final output.
2 Your Task
For each dataset, write the necessary code to:
- Plot the distribution of the number of the instances in each class.
- Run 6 different ML models:
- GNB: a Gaussian Naive Bayes Classifier, with default parameter values.
- Base-DT: a baseline Decision Tree using entropy as decision criterion and using default values values for the rest of the parameters.
- Best-DT: a better performing Decision Tree found by performing grid search to find the best combination of hyper-parameters. For this, you need to experiment with the following parameter values:
- splitting criterion: gini and entropy
- maximum depth of the tree: 10 and no maximum
- minimum number of samples to split an internal node: experiment with values of your choice
- minimum impurity decrease: experiment with values of your choice
- class weight: None and balanced
- PER: a Perceptron, with default parameter values..
- Base-MLP: a baseline Multi-Layered Perceptron with 1 hidden layer of100 neurons, sigmoid/logistic as activation function, stochastic gradient descent, and default values for the rest of the parameters.
- Best-MLP: a better performing Multi-Layered Perceptron found by performing grid search to find the best combination of hyper-parameters. For this, you need to experiment with the following parameter values:
- activation function: sigmoid, tanh, relu and identity
- 2 network architectures of your choice: for eg 2 hidden layers with 30+50 nodes, 3 hidden layers with 10+10
- solver: Adam and stochastic gradient descent
- For each model and each dataset, write the necessary code to generate a csv (comma separated values) output file that contains the output classification and the performance of each model for each dataset. This output file should be named [model name]-[dataset].csv. Therefore you should generate 12 files:
GNB-DS1 Base-DT-DS1 Best-DT-DS1 PER-DS1 Base-MLP-DS1 Best-MLP-DS1
GNB-DS1 Base-DT-DS2 Best-DT-DS1 PER-DS1 Base-MLP-DS2 Best-MLP-DS2
These files should contain:
- the row number of the instance, followed by a comma, followed by the index of the predicted class of that instance, as in:
1,24 // if your models predicted class for instance 1 is 24 (Y)
2,25 // if your models predicted class for instance 2 is 25 (Z)
3,4 // if your models predicted class for instance 3 is 4 (E)
- a plot the confusion matrix
- the precision, recall, and f1-measure for each class
- the accuracy, macro-average f1 and weighted-average f1 of the model
3 Deliverables
The submission of the assignment will consist of 3 deliverables:
- The code & output files
- The demo (8 min presentation & Q/A)
3.1 The Code & Output files
Submit all files necessary to run your code in addition to a readme.md which will contain specific and complete instructions on how to run your experiments. You do not need to submit the datasets. If the instructions in your readme file do not work, are incomplete or a file is missing, you will not be given the benefit of the doubt.
Generate one output file for which model and each dataset test sets as indicated in Section 2.
3.2 The Demos
You will have to demo your assignment for 12 minutes. Regardless of the demo time, you will demo the program that was uploaded as the official submission on or before the due date. The schedule of the demos will be posted on Moodle. The demos will consist in 2 parts: a presentation 8 minutes and a Q/A part ( 4 minutes). Note that the demos will be recorded.
3.2.1 The Presentation
Prepare an 8-minute presentation to analyse and compare the performance of your models. The intended audience of your presentation is your TAs. Hence there is no need to explain the theory behind the models. Your presentation should focus on your work and the comparison of the performance of the models when the hyper-parameters are modified.
Your presentation should contain at least the following:
An analysis of the initial dataset given on Moodle. If there is anything particular about these datasets that might have an impact on the performance of some models, explain it.
An analysis of the results of all the models with the data sets. In particular, compare and contrast the performance of each model with one another, and with the datasets. Please note that your presentation must be analytical. This means that in addition to stating the facts (e.g. the macro-F1 has this value), you should also analyse them (i.e. explain why some metric seems more appropriate than another, or why your model did not do as well as expected. Tables, graphs and contingency tables to back up your claims would be very welcome here.
In the case of team work, a description of the responsibilities and contributions of each team member.
Any material used for the presentation (slides, ) must be uploaded on EAS before the due date.
3.2.2 Q/A
After your presentation, your TA will proceed with a 4 minute question period. Each student will be asked questions on the code/assignment, and he/she will be required to answer the TA satisfactorily. In particular, each member should know what each parameters that you experimented with represent and their effect on the performance. Hence every member of team is expected to attend the demo.
In addition, your TA may give you a new dataset and ask you to train or run your models on this dataset. The output file generated by your program will have to be uploaded on EAS during your demo.
4 Evaluation Scheme
Students in teams can be assigned different grades based on their individual contribution to project.
Individual grades will be based on:
- a peer-evaluation done after the submission.
- the contribution of each student as indicated on GitHub.
- the Q/A of each student during the demo.
The team grade will be based on:
Code | functionality, proper use of the datasets, design, programming style, | 6 |
Output with initial datasets | correctness and format | 1.5 |
Demo Presentation | depth of the analysis, clarity and conciseness, presentation, time-management, | 4 |
Demo QA | correct and clear answers to questions, knowledge of the program, | 2 |
Output with demo-dataset | correctness and format | 1.5 |
Total | 15 |
Reviews
There are no reviews yet.