[SOLVED] html math python network !

$25

File Name: html_math_python_network_!.zip
File Size: 244.92 KB

5/5 - (1 vote)

!
Department of Mathematics
MATH96007 MATH97019 MATH97097 Methods for Data Science
Years 3/4/5
Coursework 2 Random Forests, SVMs, Neural Networks
Deadline: Friday, 29 November 2019, 5 pm
General instructions
The goal of this project is to analyse a data set using some of the tools introduced in the lectures, but also fol- lowing your own initiative. Coursework tasks are different from exams: they can be more open-ended and may require going beyond what we have covered explicitly in lectures. Initiative and creativity are important, as is the ability to pull together the course content, draw new links between subjects and back up your analysis with relevant computations. The quality of presentation and communication is very important, so use good combina- tions of tables and figures to present your results.
Submission instructions can be found at the end of this document. You must submit a Jupyter notebook, exe- cute it, and submit the html file. You can produce a notebook in Google Colab, or you are free to use your local Jupyter notebook through the Anaconda environment downloaded on your computer. Before you produce your html, make sure all cells are executed, so their output appears in your html file.
To gain the marks for each part you are required to: (1) complete the task as described, (2) comment your code so that we can understand each step, and (3) provide a brief written introduction to the task explaining and dis- cussing what you did.
This coursework is worth 35% of your total mark, and contains a mastery component for MSc and MSci students.
Overview
In this second coursework, you will work with a data set that contains various variables describing used cars. After your studious and successful time at Imperial College, you landed your dream job as a used car dealer. Actually, you do not know much about used cars, but, luckily, you have a record of the decisions made by your predecessor where she had rated a series of cars as `unacceptable, `acceptable, `good, or `very good. For- tunately, your predecessor also kept detailed records to support these decisions based on the following de- scriptors:
buying: buying price
maint: price of the maintenance
doors: number of doors
persons: capacity in terms of persons to carry
lug_boot: the size of luggage boot
safety: estimated safety of the car
In an effort to impress your manager, you will compare different classification methods that can work with such multi-dimensional data.
You can find the data on Blackboard.

Preliminaries
You should do some data exploration, perhaps reusing some of the code from CW1. This can help inform your explanations below but you do not need to report your data exploration in this CW.
1. You are expected to prepare and clean up the data sufficiently without our specific instructions. Just write down the Python commands you used for the preparation of the dataset.
2. A key step in your pre-processing is the standardisation of the descriptors of the data set, which you can do using sklearn. Explain briefly why this is needed and write down your commands.
On Blackboard you are provided with a training set (containing 80% of the data) and a test set (containing 20% of the data). Throughout the coursework below:
You should optimise hyperparameters using stratified k-fold cross validation of your training set. Make sure that the k-folds are properly balanced by using the correct sklearn options.
Do not use your test set for the optimisation of hyperparameters. The test set should only be used to test the final generalisability of your optimised model.
Task 1: Random Forest Classifier (25 marks)
Train a random forest classifier on the training set to optimise the accuracy of your validation predictions using 5-fold stratified cross validation.
You should use the 5-fold cross validation to explore and optimise over suitable ranges the following hyperpa- rameters: (i) number of decision trees; (ii) depth of trees, (iii) maximum number of descriptors (features) ran- domly chosen at each split.
Note: Although other measures of performance, such as precision, recall, etc, could (and should) be used for comparison, here we only concentrate on accuracy for simplicity.
Explain and document your choice for the best RF model. Explain which of the hyperparameters have a bigger impact on performance.
Task 2: Support Vector Machines (25 marks)
Train a support vector machine (SVM) on your training data using three different kernel functions: (i) linear; (ii) polynomial, (iii) RBF. Each of those kernels have a few hyperparameters which you should explore and tune using 5-fold cross validation to find the kernel (and hyperparameters) with the highest validation accuracy.
Explain and document your choice for the best SVM model (kernel and hyperparameters) for our data.
Task 3: Neural Networks (30 marks)
In Task 3, you will explore how a feed-forward neural network performs on the training set.
Using Pytorch, train a neural network to classify each car in the training set based on the given descriptors.
Setup of the network: Your network should have two hidden layers, each with 200 neurons. You should use ReLU as your activation function. Fix the optimisation method to be stochastic gradient descent (SGD), and define the loss function as cross-entropy. You should train on batches of 64 data points with a learning rate of 0.01 for 120 epochs.
3.1 Show and document how changing the learning rate to: (i) 0.0005 and (ii) 0.95 leads to poor con- vergence.
3.2 Show and document how changing the batch size to: (i) 2 and (ii) 256 leads to poor convergence and performance. Explain the reasons.
3.2 In Pytorch, implement dropout regularisation to the second layer of the NN, and tune the dropout rate to optmise the validation accuracy of the NN.

Explain and document your choice for the best NN model (dropout rate) for the given architecture.

Task 4: Discussion (20 marks)
4.1 Compare the performance of the three classifiers you have obtained in Tasks 1, 2 and 3 by apply- ing them to the test data set. You should report the accuracy, recall, precision, and F1 score and any other relevant score derived from the confusion matrix. Examine your results in relation to the per- formance obtained for the training set.
4.2 Discuss the suitability of each of the three methods for our task. Comment on their generalizability to the test data, computational cost, and the appropriateness given the dimensionality of the data and any other insights based on your study of the descriptors of the data set. You should base your evalua- tions on evidence and computations.
Task 5: (for MSc and MSci students only) (25 marks)
5.1 (20 marks) Consider the NN model you implemented in Task 3. Optimise a new NN where the ar- chitecture is changed to have 5 hidden layers with 80 neurons each. Compare the performance of this deep network to the shallow network in Task 3 in terms of performance, training and computa- tional cost. Feel free to provide additional evidence and computations to support your analysis.
5.2 (5 marks) Consider the NN model you implemented in Task 3. Optimise a NN with the same archi- tecture as in Task 3 but changing the activation units from ReLU to Sigmoidal. Compare the perfor- mance of the ReLU and sigmoidal NNs and, specifically, their speed of convergence.
Submission instructions
You will upload two documents to Blackboard, wrapped into a single zip file: 1) Your notebook as an ipynb file.
2) Export your notebook as an html file.
You are also required to comply with these specific requirements:
Name your files as SurnameCID.zip, e.g. Smith1234567.zip. Do not submit multiple files.
Your ipynb file must produce all plots that appear in your html file, i.e., make sure you have run all cells in the notebook before exporting the html.
Use clear headings in your html to indicate the answers to each question, e.g. Task 1.
To avoid last minute problems with your online submission, we recommend that you upload versions of your coursework early, before the deadline. You will be able to update your CW until the deadline but this provides you with some safety back up.
Note: There seem to be some issues with particular browsers (or settings like cookie or popup blockers) when submitting to Turnitin. If the submission hangs, please try another browser. You should also check that your files are not empty or corrupted after submission.
Needless to say, projects must be your own work.
You may discuss the analysis with your colleagues but the code, writing, figures and analysis must be your own. The Department may use code profiling and tools such as Turnitin to check for plagiarism, and plagiarism cannot be tolerated.
Copying and plagiarism, if they occur, may force the Department to stop offering project-based courses such as this one.

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] html math python network !
$25