ECMM444 Fundamentals Of Data Science
Continuous Assessment 1
This continuous assessment (CA) comprises 40% of the overall module assessment. This is an individual exercise and your attention is drawn to the College and University guidelines on collaboration and plagiarism, which are available from the College website.
Question 1
[30 marks]
a) Create a 2D data matrix A containing 200 vectors normally distributed with zero mean and a standard deviation on the first axis equal to 1 and on the second axis equal to 3.
Create a second data matrix B containing 200 vectors normally distributed with mean in [2, 2] and unit standard deviation.
Consider the data matrix D that contains vectors from both A and B. If plotted you should obtain a figure similar to the following:
b) A vector v is dominated by a vector u if u[i] > v[i] for all components i.
The Pareto front of a set of vectors D is the set of vectors that are not dominated by any other vector in D.
Using the notion of a Pareto front write the code to produce the following type of output:
(Total 30 marks)
Question 2
Acquire the Iris dataset using the following procedure: python from sklearn.datasets import load_iris X,y = load_iris(return_X_y=True) The data matrix contains 100 vectors (also called instances) with 4 attributes each (i.e. it is a 100 x 4 matrix).
The vector y contains the class information associated to each instance; in our case the class is an integer in {0, 1, 2}.
a) Transform the data matrix X in a 2D data matrix D. To do so consider only attribute with index 0 and attribute with index 1. Plot D using a different color for each class.
b) Define a function to compute the distance between two vectors (of arbitrary dimension) as the lenght of the difference vector.
c)TheknearestneighborofavectorDi DisthevectorDj Dthatisin position k when all vectors in D (excluding Di) are arranged by increasing distance from Di.
A k-outlier is an instance Di that is not within the k nearest neighbors of any of its k nearest neighbors (i.e. if an instance a has neighbors b and c but a is not a 2-nearest neighbor of b nor of c, then a is a 2-outlier).
Define a function find_outliers that takes in input a value k and a data matrix D and returns a Boolean vector O with entries Oi that are True if the corresponding instance Di is a k-outlier.
d) Using the function find_outliers identify the k-outliers when k = 4 for the data in each class in D independently and plot a figure distinguishing them.
You should obtain a figure similar to the following:
(Total marks 40)
Question 3
Use the 2D data matrix D from Question 2.
Use the function to compute the distance between two vectors (of arbitrary
dimension) as the lenght of the difference vector.
a) Find the two vectors u,v in D that are most distant.
b) Consider the difference vector m = u v and from m bulid the corresponding versor (i.e. the unit vector) a.
c) Build a versor b that is orthogonal to a (hint: you should use the Gram- Schmidt procedure).
d) Build the new basis {a, b}. Express D in the new basis and plot the resulting data matrix using a different color for each class.
(Total marks 30)
Submitting your work
Please write your student ID in the first cell of the notebook. You should submit the Jupyter notebook containing the code with its output for all the questions. Make a separate cell for each point a), b), c), etc of each question. In Turnitin submit a single archive file .zip or .tgz containing both a PDF copy of your notebook and also the source file with extension .ipynb . Markers will not be able to give feedback if you do not submit the PDF version of your code and marks will be deducted if you fail to do so.
Marking criteria
Work will be marked against the following criteria. Although it varies a bit from question to question they all have approximately equal weight.
Does your algorithm correctly solve the problem? In most of the questions the required code has been described, but not always in complete detail and some decisions are left to you.
Is the code syntactically correct? Is your program a legal Python program regardless of whether it implements the algorithm?
Is the code beautiful or ugly? Is the implementation clear and efficient or is it unclear and extremely inefficient (e.g. it takes more than a few minutes to execute)? Is the code well structured? Have you made good use of functions? Are you using Numpy functions on entire arrays when possible?
Is the code well laid out and commented? Is there a comment describing what the code does? Have you used space to make the code clear to human readers?
There are 10% penalties for:
Not submitting the PDF version of your programs. Not creating functions as instructed in the questions.
Reviews
There are no reviews yet.