- Problem 1
In this problem, develop code to analyze the Iris data sets using the test statistics listed in Table 1. Table 1: Data Analysis Statistics
Test Statistics | Statistical Function F() |
Standard Deviation |
The analysis should be done by feature followed by class of flower type. This analysis should provide insight into the Iris data set.
Note: The trimmed mean is a variation of the mean which is calculated by removing values from the beginning and end of a sorted set of data. The average is then taken using the remaining values. This allows any potential outliers to be removed when calculating the statistics of the data. Assuming the data in xs = [x1,s,x2,s, ,xn,s] is sorted, the resulting xs,p = [x1+p,s,x2+p,s, ,xnp,s]. the trimmed mean allows the removal of extreme values influencing the mean of the data.
- Problem 2 Parts a and b
In this problem we will begin to analyze Iris data based on the class of flower type using linear discriminant analysis.
- Implement the two class linear discriminant based on the Fishers Linear Discriminant (FLD) two-class separability (Fisher, 1936) described below. This is also shown in the two class linear discriminant function presented in (Bishop, 2006) Section 4.1.1 Two classes. For this exercise you will want to separate your Iris data into three sets and focus on any two class combination. For example, from the iris data take the first 50 observations for class 1, the next 50 as class 2 and the final 50 as class 3. Using the two class linear discriminant function compare class 1 verses class 2, class 1 verses class 3 and finally compare class 2 versus class 3.
- For this problem you will want to expand the two class case from part a to a three class case as presented in (Bishop, 2006) from Section 4.1.2 Multiple classes.
Now that we have our statistic set up let look a the mean and standard deviation between the classes (Iris flower types) and within the classes lets consider the Fishers Linear Discriminant (FLD) to quantify two-class separability of features (Fisher, 1936). FLD is a simple technique which measures the discrimination of sets of real numbers. Without going into all of the theory of the FLD lets focus on the primary components assuming we have a two class problem, equal class sample and a covariance matrix that is generated from normal distributions. The within-class scatter matrix is defined as
SW = XPCSCCwhere SC is the covariance matrix for class C {1,+1} | (1) |
lCSC = X(x C)(x C)T | (2) |
i=1,
iC
and PC is the a priori probability class C. That is, PC kC/k, where kC is the number of samples in class C, out of a total of k samples. The between-class smatter matrix is defined as
SB = X(1 +1)(1 +1)TCwhere is the global mean vector | (3) |
(4)
and the class mean vector C is defined as
(5)
Now lets look at the criterion function J() written as follows:
wTSBw
J(w) = (6) wTSWw
where w is calculated to optimize J() as follows:
w ) (7)
w for the Fisher Linear Discriminant has been obtained, which will allow for the linear function to yield the maximum ratio between of the between-class scatter and the within-class scatter. Now lets determine a threshold b that will allow us to determine which class a new observation will belong to. The optima decision boundary assuming each class has the same number of samples can be calculated as follows:
b = 0.5(w1 + w+1) (8)
Now, if we have a new input observation x we can determine which class the new observation belongs to based on the following
y = wx + b (9)
where y < 0 is class 1 and y 0 is class +1.
The previous discussion is based on the FLD and is simplified as a two class linear discriminant function presented in (Bishop, 2006) Section 4.1.1 Two classes. Credit is given to Fisher for his work in this area of linear discrimination.
- Problem 3 Note this is a Collaborative Problem
25 Points Total
In this problem the Iris data set is to be expanded with synthetic data so that 100 additional observations are generated for each flower class resulting in 300 additional observations. Once the data is generated make similar figure as provided in Figure 1 (a) for each set of paired features and classes.
So lets take the first 50 observations, the first feature (sepal length) and fourth feature (petal width) shown in red as observed in Figure 1. The 100 additional observations generated are show in blue. In this example the data has similar covariance matrix, mean, minimum and maximum. The synthetic data was generated using the covariance matrix, mean, minimum and maximum of the data. Random data was generated that contained 100 observations and 4 features. The random data was multiplied by the covariance matrix, normalized to fit the original Iris data in terms of minimum and maximum values then the mean of the data was set based on the Iris mean.
- Synthetic Data (blue) vs Iris Data (red)
- Distributions
Figure 1: Synthetic Data vs Iris Data (a) shows the synthetic data in blue and the original Iris in red, (b) the distributions of the data are shown for context.
- Problem 4 Note this is a Collaborative Problem
In some application areas of data science, data retrieval and data cleansing are critical to the entire analysis process. One example is portfolio analysis. Elseviers Scopus (https://www.scopus.com) is the largest abstract and citation database of peer reviewed literature: scientific journals, books and conference proceedings. It covers nearly 36,377 titles from approximately 11,678 publishers, of which 34,346 are peer-reviewed journals in top-level subject fields: life sciences, social sciences, physical sciences and health sciences.
- Go to the Scopus website and search for data science and machine learning related documents. Plot the distribution of the number of documents by year from at least the last 10 years. What is the story that the plot tells you?
- Limit the search to 2016 and 2017. List the possible data fields/columns you may need to export in order to answer the question of author and/or institution collaborations in this scientific area during this timeframe.
- Within the possible fields you suggest to export, which fields need data cleansing and why, in order to provide robust input for performing portfolio analysis?
Reviews
There are no reviews yet.