School of Computer Science 2019 2020
Assessment Item 2
Title: CMP3749M Big Data
Indicative Weighting: 50%
Learning Outcomes
On successful completion of this component a student will have demonstrated competence in the following areas:
LO1 critically appraise and apply Big Data Analytics concepts, tools and techniques
LO2 apply data science toolkits in a range of applications and solve real- world problems
Overview
As a data scientist, our main objective is to organize and analyse data regardless how big or small the data is, often employing typical data science software. The analysis made by a data scientist must be easy enough to understand for all the stakeholders including those who have no knowledge of data science. The objective of this assignment is to show that you are able to make an analysis over a data set to guide the stakeholders to understand the data. The data can be downloaded from Blackboard. The data needs to be analysed using the data science tools and techniques you were taught in class and detailed in the Report Guidance (see below). You are required to write and submit a report where you need to provide answers to all questions, discuss how you completed the tasks, and provide snippets of MATLAB code youve developed for these tasks. You are expected to go into sufficient depth to demonstrate knowledge and critical understanding of the relevant processes involved. 50% of available marks are through the completion of the written report, with clear and separate marking criteria for each required report section.
Report Guidance
You must supply a written report containing two distinct sections that provide a full and reflective account of the processes undertaken. You are expected to answer all questions in each step in each section in detail, perform all analysis on your own (i.e. individual work), and provide all MATLAB scripts in one ZIP file.
The data:
This data table contains clinical features measured for 64 patients with cancer and 52 healthy controls. The last column in the table indicates the presence or absence of cancer, where the number `1 represents healthy controls and `2 represents patients with cancer. All the other columns are clinical features. You are asked to provide an analysis over this data to discuss if these clinical features could be potentially used as a biomarker of this particular cancer, equivalently, if these features could be used to predict the presence of the cancer.
Section I: Data Summary, Understanding and Visualisation (30%)
Download the data set named `clinicalfeatures.xlsx from Blackboard and save it anywhere in your computer. You need to write MATLAB code to accomplish the following tasks.
As a first step, you need to load the data from the file `clinicalfeatures.xlsx into MATLAB Workspace.
Task 1: Before making any analysis, it is required to know if there are missing values in the table. Are there any
1
missing values in the table? Discuss how you will deal with missing values, even if there are no missing values in this data set.
Task 2: Before making an analysis, it is beneficial to understand the data by looking at the summary statistics. There are two groups of subjects in this clinical experiment. For each group, show the following summary statistics for each feature in a table: minimum, maximum, mean, median and variance values. For each group, plot the box plot for each feature.
Task 3: We want to understand the relationship between features. If two features have high correlations, using only one of them could be enough for our analysis. Show in a table the correlation matrix of the features, where each element in the matrix shows the correlation coefficient of two features. Discuss your observations on the correlation matrix. Are there any features which are highly correlated? In any case, we will use all the features in the following tasks.
Section II: Classification & Big data analysis (70%)
As we had a preliminary analysis on the data, we want to see if the clinical features could be used to predict the presence of the cancer. This is treated as a classification problem.
Task 4: Shuffle the data samples and split it into a 70% training set and a 30% test set. How many examples in each group for the training dataset? How many examples in each group for the testing dataset?
Task 5: Train a decision tree and a support vector machine model using the training set, and then apply the trained classifiers to the test set. You will obtain the predicted labels for the test set. Now evaluate the classifiers, respectively, by computing the error rate (`Correctly Classified Samples divided by `Classified Sample). Calculate the sensitivity and specificity. Make a discussion on the error rate, sensitivity and specificity.
Task 6: Compare the decision tree and the support vector machine based on the results obtained in task 5. Which method would you prefer for classification for this data?
Task 7: Based on the analysis over this data, discuss if these clinical features could be potentially used as a biomarker of this particular cancer.
Task 8: A larger dataset in `clinicalfeatures1.csv which contains much more clinical data samples related to only healthy people can be downloaded from the blackboard. Based on this larger dataset, please use MapReduce to calculate the minimum, maximum, mean and variance values for the HOMA feature.
Submission Instructions
The submission deadline of this assignment is included in the School Submission dates on Blackboard. You must make an electronic submission of your assessment report to the Turnitin upload area for assessment 2. . The ZIP file which contains MATLAB scripts can be submitted to Assessment Item 2 Supporting Documentation Upload.
The report must:
Contain your name, student number, student email address, and module name;
2
Be in PDF and no more than 12 pages (including everything!)
Be formatted single-spaced with 11pt font size;
Do not include this briefing document.
This assessment is an individually assessed component. Your work must be presented according to the School of Computer Science guidelines for the presentation of assessed written work. Please make sure you have a clear understanding of the grading principles for this component as detailed in the accompanying Criterion Reference Grid. Your citations and referencing should be in accordance with University guidelines.
If you are unsure about any aspect of this assessment component, please seek the advice of the module co- ordinator Miao Yu < [email protected]>
3
Reviews
There are no reviews yet.