title: High Dimensiona Data Analysis
output: pdf_document
`{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,eval=TRUE, error=TRUE)
`
`{r, echo=FALSE ,eval=TRUE,message=FALSE}
library(MASS)
library(ca)
library(knitr)
library(kableExtra)
library(dplyr)
library(stats)
library(broom)
library(tidyverse)
`
# Standardisation and Distance **(10 Marks)**
*The following question only requires you to use the variables Income, Education Years and Age.*
*1. Standardise income, education_years and age by centering (subtracting the mean) and scaling (dividing by the standard deviation) using the `scale` function. Print out the first 5 observations.* **(1 Mark)**
`{r}
#INCLUDE YOUR R CODE HERE
`
*2. From your answer to Q1, what is the standardised value of Age for the first observation (Nichols) in your data* **(1 Mark)**
INCLUDE YOUR ANSWER HERE
*3. The government proposes a currency reform, introducing new dollars that are exchanged at a rate of 100 to 1. The effect of this is that every income should be divided by 100.Create a variable `NewIncome` which is equal to Income divided by 100 (**`NewIncome` is only to be used for question A**).* **(1 Mark)**
`{r}
#INCLUDE YOUR R CODE HERE
`
*4. Find the Manhattan Distance between the first and second observation (Nichols and Fisher) using income, education_years and age as the variables.Do NOT standardise the data* **(1 Mark)**
INCLUDE YOUR ANSWER HERE
*5. Find the Manhattan Distance between the first and second observation (Nichols and Fisher) using new income, education_years and age as the variables.Do NOT standardise the data* **(1 Mark)**
INCLUDE YOUR ANSWER HERE
*6. Are the answers to Question 4 and Question 5 the same?Why or why not?***(2 Marks)**
INCLUDE YOUR ANSWER HERE
*7. How would your answer to question 6 change, if Euclidean distance were used in Question 4 and Question 5?* **(1 Mark)**
INCLUDE YOUR ANSWER HERE
*8. Explain the role that distance plays in collaborative filtering for recommender systems.***(2 Marks)**
INCLUDE YOUR ANSWER HERE
ewpage
# Clustering and multidimensional scaling **(5 Marks)**
*1. Use complete linkage to conduct Cluster Analysis for all the individuals in the Health `sector`. Usenumeric variables only.* **(1 Mark)**
`{r}
#INCLUDE YOUR R CODE HERE
`
*2. Construct a dendogram where the 3-cluster solution is highlighted.* **(1 Mark)**
`{r}
#INCLUDE YOUR R CODE HERE
`
*3. Provide the centroids of the 3-cluster solution.* **(1 Mark)**
`{r}
#INCLUDE YOUR R CODE HERE
`
*4. Use Multidimensional Scaling to provide a 2-dimensional representation of the same observations used in Question 1.* **(1 Mark)**
`{r}
#INCLUDE YOUR R CODE HERE
`
*5. Produce a scatter plot with the 2-dimensional MDS representation. Colour the points according to the cluster membership for the 3-cluster solution in Question 2.* **(1 Mark)**
`{r}
#INCLUDE YOUR R CODE HERE
`
ewpage
# Principal Components Analysis **(10 Marks)**
*1. Carry out Principal Components on the data using all numeric variables.* **(2 Marks)**
`{r}
#INCLUDE YOUR R CODE HERE
`
*2. Did you standardise the variables?Why or why not?* **(2 Marks)**
INCLUDE YOUR ANSWER HERE
*3. What is the weight on age for the 3rd principal component?* **(1 Mark)**
INCLUDE YOUR ANSWER HERE
*4. What is the variance of the 4th principal component?* **(1 Mark)**
`{r}
#INCLUDE YOUR R CODE HERE
`
INCLUDE YOUR ANSWER HERE
*5. Make a Scree plot.***(1 Mark)**
`{r}
#INCLUDE YOUR R CODE HERE
`
*6. According to the Scree plot how many principal components should be used in the analysis. How did you choose this value?* **(2 Marks)**
INCLUDE YOUR ANSWER HERE
*7. How many principal components would be used in the analysis if Kaisers rule were used?* **(1 Mark)**
INCLUDE YOUR ANSWER HERE
ewpage
# Multidimensional Scaling **(15 Marks)**
*1. Using only those observations for which `second_language` is Chinese, carry out classical multidimensional scaling.Find a two dimensional representation and use standardised value of `income`, `experience`, `age`, `education_years` and `siblings` as the variables.***(4 Marks)**
`{r}
#INCLUDE YOUR R CODE HERE
`
*2. Plot a 2-dimensional representation of this data. Rather than plot the observations as points use the individuals surnames.* **(3 Marks)**
`{r}
#INCLUDE YOUR R CODE HERE
`
*3. Name two individuals (by surname) who are similar according to your plot in Question 2, and two individuals (by surname) who are different.If you were unable to generate the plot in Question 2, then describe how you would answer this question.* **(1 Mark)**
INCLUDE YOUR ANSWER HERE
*4. Plot the same plot as in Question 2 using Kruskals algorithm.***(3 Marks)**
`{r}
#INCLUDE YOUR R CODE HERE
`
*5. Are your conclusions in Question 3 robust to using a different multidimensional scaling method?If you were unable to generate the plot in Question 2 and/or Question 4,then describe how you would answer this question.* **(1 Mark)**
INCLUDE YOUR ANSWER HERE
*6. Describe the differences between classical multidimensional scaling and Kruskals algorithm.***(3 Marks)**
INCLUDE YOUR ANSWER HERE
ewpage
# Correspondence analysis (ETF3500 students only) **(10 Marks)**
*1. Construct a contingency table between the `sector` and `second_language` variables. Only consider those individuals with at most three `education_years`* **(1 Mark)**
`{r}
#INCLUDE YOUR R CODE HERE
`
*2. Using the contingency table in point 1, perform correspondance analysis on the `sector` and `second_language` variables and visualise the results.* **(2 Marks)**
`{r}
#INCLUDE YOUR R CODE HERE
`
*3. Based on the results in point 2, what language is most similar to Greek?.* **(1 Mark)**
`{r}
#INCLUDE YOUR R CODE HERE
`
INCLUDE YOUR ANSWER HERE
*4. Based on the results in point 2, how much inertia is explained by the first dimension?* **(1 Mark)**
INCLUDE YOUR ANSWER HERE
*5. Repeat point 2, but this time, only consider those individuals whose income is less than $100000 or whose number of siblings is more than one.* **(2 Marks)**
`{r}
#INCLUDE YOUR R CODE HERE
`
*6. Based on the results in point 5, which sector is the most similar to Construction?***(1 Mark)**
INCLUDE YOUR ANSWER HERE
*7. Compute how much inertia is explained overall by the figures in points 2 and 5. Discuss in which of these two exercises CA helps explain a larger amount of inertia.* **(2 Marks)**
INCLUDE YOUR ANSWER HERE
ewpage
# Correspondence analysis (ETF5500 students only) **(10 Marks)**
*1. Using only individuals whose `gender` is Male and whose `income` is at least $100000, construct a contingency table between the `sector` and `second_language` variables.* **(1 Mark)**
`{r}
#INCLUDE YOUR R CODE HERE
`
*2. Using the contingency table in point 1, perform correspondance analysis on the `sector` and `second_language` variables and visualise the results.* **(1 Mark)**
`{r}
#INCLUDE YOUR R CODE HERE
`
*3. Based on the results in point 2, which sector is most associated to people that speak Spanish as a second language?.* **(1 Mark)**
INCLUDE YOUR ANSWER HERE
*4. Based on the results in point 2, how much inertia is explained by the first dimension?* **(1 Mark)**
INCLUDE YOUR ANSWER HERE
*5. Repeat point 2, but this time, only consider those individuals whose `gender` is Male and whose `income` is less than $100000.* **(1 Mark)**
`{r}
#INCLUDE YOUR R CODE HERE
`
*6. Based on the results in point 5, which language is most associated with the **Health** sector? ***(1 Mark)**
INCLUDE YOUR ANSWER HERE
*7. Compute how much inertia is explained overall by the figures in points 2 and 5. Discuss in which of these two exercises CA helps explain a larger amount of inertia.* **(1 Mark)**
INCLUDE YOUR ANSWER HERE
*8. Disscuss the differences or similarities between the results obtained in points 2 and 5, for example, are the associations between `sector` and `second_language` consistent?* **(1 Mark)**
INCLUDE YOUR ANSWER HERE
*9. In your own words, describe the connection between the chi square statistic and correspondace analysis* **(2 Marks)**
INCLUDE YOUR ANSWER HERE
Reviews
There are no reviews yet.