Semester Two 2020
Supplementary Exam – Alternative Assessment Task
High Dimensional Data Analysis
2 hours 40 minutes (includes reading, downloading, and uploading time)
 This is an individual assessment task.
 This is an open book exam
 All responses must be included in the RMARKDOWN template document available on Moodle, and then rendered into a pdf document.
 ALL STUDENTS are required to answer questions A, B, C and D. ETF3500 students are required to answer question E.
ETF5500 students are required to answer question F.
 This assessment accounts for 50% of the total in the unit.
 Any model of calculator is allowed.
 Upon completion of this assessment task, please upload the pdf document to Moodle using the assignment submission link.
 Your submission must occur within 2 hours and 40 minutes of the official commencement of this assessment task (Australian Eastern Daylight Time).
Please read the next page carefully and sign and date the Student Statement before commencing the assessment task.
The exam
This exam uses simulated data that emulates SOME features of the Australian labour market. By now, you should have access to your data set which is produced according to your student ID. The dataset provides 9 attributes on 500 individuals who speak English as their first language. The following variables are provided in the dataset:
surname: Surname of the individual.
income: Yearly income (dollars).
experience: Work experience (years).
age: Age of the individual (years).
gender: Gender of the individual.
sector: Industry of work.
second_language: Second language spoken by the individual. education_years: Total number of tertiary education years. siblings: Total number of siblings.
Based on this information you must answer the questions below. Code to produce each of the R outputs in your answers must be provided.
A Standardisation and Distance (10 Marks)
The following question only requires you to use the variables Income, Education Years and Age.
1. Standardise income, education_years and age by centering (subtracting the mean) and scaling (dividing by the standard deviation) using the scale function. Print out the first 5 observations. (1 Mark)
2. From your answer to Q1, what is the standardised value of age for the first observation (Nichols) in your data (1 Mark)
3. The government proposes a currency reform, introducing new dollars that are exchanged at a rate of 100 to 1. The effect of this is that every income should be divided by 100. Create a variable NewIncome which is equal to Income divided by 100 (NewIncome is only to be used for question A). (1 Mark)
4. Find the Manhattan Distance between the first and second observation (Nichols and Fisher) using income, education_years and age as the variables. Do NOT standardise the data (1 Mark)
5. Find the Manhattan Distance between the first and second observation (Nichols and Fisher) using new income, education_years and age as the variables. Do NOT standardise the data (1 Mark)
6. Are the answers to Question 4 and Question 5 the same? Why or why not? (2 Marks) 7. How would your answer to question 6 change, if Euclidean distance were used in Question
4 and Question 5? (1 Mark)
8. Explain the role that distance plays in collaborative filtering for recommender systems. (2 Marks)
B Clustering and multidimensional scaling (5 Marks)
1. Use complete linkage to conduct Cluster Analysis for all the individuals in the Health sector. Use numeric variables only. (1 Mark)
2. Construct a dendogram where the 3-cluster solution is highlighted. (1 Mark) 3. Provide the centroids of the 3-cluster solution. (1 Mark)
4. Use Multidimensional Scaling to provide a 2-dimensional representation of the same observations used in Question 1. (1 Mark)
5. Produce a scatter plot with the 2-dimensional MDS representation. Colour the points according to the cluster membership for the 3-cluster solution in Question 2. (1 Mark)
C Principal Components Analysis (10 Marks)
1. Carry out Principal Components on the data using all numeric variables. (2 Marks) 2. Did you standardise the variables? Why or why not? (2 Marks)
3. What is the weight on age for the 3rd principal component? (1 Mark)
4. What is the variance of the 4th principal component? (1 Mark)
5. Make a Scree plot. (1 Mark)
6. According to the Scree plot how many principal components should be used in the analysis.
How did you choose this value? (2 Marks)
7. How many principal components would be used in the analysis if Kaiser’s rule were used?
(1 Mark)
D Multidimensional Scaling (15 Marks)
1. Using only those observations for which second_language is Chinese, carry out classical multidimensional scaling. Find a two dimensional representation and use standardised value of income, experience, age, education_years and siblings as the variables. (4 Marks)
2. Plot a 2-dimensional representation of this data. Rather than plot the observations as points use the individuals’ surnames. (3 Marks)
3. Name two individuals (by surname) who are similar according to your plot in Question 2, and two individuals (by surname) who are different. If you were unable to generate the plot in Question 2, then describe how you would answer this question. (1 Mark)
4. Plot the same plot as in Question 2 using Kruskal’s algorithm. (3 Marks)
5. Are your conclusions in Question 3 robust to using a different multidimensional scaling method? If you were unable to generate the plot in Question 2 and/or Question 4, then describe how you would answer this question. (1 Mark)
6. Describe the differences between classical multidimensional scaling and Kruskal’s algorithm.
(3 Marks)
E Correspondence analysis (ETF3500 students only) (10 Marks)
1. Construct a contingency table between the sector and second_language variables. Only consider those individuals with at most three education_years (1 Mark)
2. Using the contingency table in point 1, perform correspondance analysis on the sector and second_language variables and visualise the results. (2 Marks)
3. Based on the results in point 2, what language is most similar to Greek?. (1 Mark)
4. Based on the results in point 2, how much inertia is explained by the first dimension? (1
5. Repeat point 2, but this time, only consider those individuals whose income is less than $100000 or whose number of siblings is more than one. (2 Marks)
6. Based on the results in point 5, which sector is the most similar to Construction?(1 Mark)
7. Compute how much inertia is explained overall by the figures in points 2 and 5. Discuss in which of these two exercises CA helps explain a larger amount of inertia. (2 Marks)
F Correspondence analysis (ETF5500 students only) (10 Marks)
1. Using only individuals whose gender is Male and whose income is at least $100000, construct a contingency table between the sector and second_language variables. (1 Mark)
2. Using the contingency table in point 1, perform correspondance analysis on the sector and second_language variables and visualise the results. (1 Mark)
3. Based on the results in point 2, which sector is most associated to people that speak Spanish as a second language?. (1 Mark)
4. Based on the results in point 2, how much inertia is explained by the first dimension? (1 Mark)
5. Repeat point 2, but this time, only consider those individuals whose gender is Male and whose income is less than $100000. (1 Mark)
6. Based on the results in point 5, which language is most associated with the Health sector? (1 Mark)
7. Compute how much inertia is explained overall by the figures in points 2 and 5. Discuss in which of these two exercises CA helps explain a larger amount of inertia. (1 Mark)
8. Disscuss the differences or similarities between the results obtained in points 2 and 5, for example, are the associations between sector and second_language consistent? (1 Mark)
9. In your own words, describe the connection between the chi square statistic and correspon- dace analysis (2 Marks)
