THIS PAPER IS FOR STUDENTS STUDYING AT: (tick where applicable)
RCaulfield oClayton oParkville oPeninsula o Monash Extension o Off Campus Learning o Malaysia o Sth Africa oOther (specify)
Office Use Only
EXAM CODES:
TITLE OF PAPER:
EXAM DURATION:
READING TIME:
2018/2019 Summer Semester Examination Period (February 2019)
Faculty of Business and Economics
ETX2250 / ETF5922
Data Visualisation and Analytics 2 hours writing time
10 minutes
During an exam, you must not have in your possession any item/material that has not been authorised for your exam. This includes books, notes, paper, electronic device/s, mobile phone, smart watch/device, calculator, pencil case, or writing on any part of your body. Any authorised items are listed below. Items/materials on your desk, chair, in your clothing or otherwise on your person will be deemed to be in your possession.
No examination materials are to be removed from the room. This includes retaining, copying, memorising or noting down content of exam material for personal use or to share with any other person by any means following your exam.
Failure to comply with the above instructions, or attempting to cheat or cheating in an exam is a discipline offence under Part 7 of the Monash University (Council) Regulations, or a breach of instructions under Part 3 of the Monash University (Academic Board) Regulations.
AUTHORISED MATERIALS
OPEN BOOK oYES RNO
CALCULATORS R YES o NO
(If YES, only a HP 10bII+ calculator is permitted, except at Malaysia and South Africa campuses where an
approved for use Faculty label is permitted)
SPECIFICALLY PERMITTED ITEMS RYES if yes, items permitted are: Ruler
oNO
Candidates must complete this section if required to write answers within this paper
STUDENT ID: __ __ __ __ __ __ __ __ DESK NUMBER: __ __ __ __ __
Page 1 of 12
Background for Question 1 and 2.
The following scenario will be used in question 1 and 2.
A researcher is interested in comparing economic and health indicators across countries in Africa, Asia and the Middle East based on data from the World Bank. The data used consists of a data frame countries.df:
GDP:
LaborRate:
HealthExp:
InfMortality:
RegionName:
Name:
Per capita Gross Domestic Product, in adjusted 2011 U.S. Dollars Labor force participation rate.
Health expenditures in U.S. Dollars. Infant mortality per 1000 live births. taking values Africa, Asia, Middle East Name of country
Question 1 [3 + 3 + 4 + 5 + 5 + 5 = 25 marks]
Use the Figure 1.1 through 1.3 as input to answer the following questions
a) What is the correlation between Infant mortality per 1000 live births and GDP in Asia ? a. -0.66 (3)
b) What is approximately the median Infant mortality per 1000 live births and GDP in Asia ? a. Just above 25 (3)
c) Discuss a prominent outlier in the Africa data which is apparent in Figure 1.3. Explain what you can determine about this outlier using information from any relevant graphs.
a. Figure 1.3 shows one country in Africa that has very high GDP compared with the rest of Africa (above 20,000). This makes it an outlier amongst just the Africa data.
However it also has remarkably high infant mortality for a country with such a high GDP. In Africa, high infant mortality is common, but the high GDP combined with high infant mortality makes it an outlier even when all countries are included, as in the lower LH corner of Figure 1.1. There are no other countries with such a high GDP and such high infant mortality as well. Figure 1.2 shows that there are no outliers in infant mortality according to the box plots, and this agrees with observations on the other graphs. Note: Students might also point out Figure 1.2 where there is an outlier for the Middle East (4)
d) Using Figure 1.1, discuss the relationship between Health Expenditure and Infant Mortality.
a. In Figure 1.1, we see that in each region, the correlation between Health Expenditure and Infant mortality is negative as expected, but not very strong. (1)
b. From the scatterplot of Health Expenditure against Infant Mortality, we see that Health Expenditure is uniformly very low in Africa. (2)
c. In Asia and the Middle East there are several countries with very high health expenditure and these have very low Infant Mortality. (2)
e) Which graph would you use to highlight the difference in infant mortality between Africa and the other two regions? Discuss your chosen graph in detail.
a. Boxplot: It is easy to quantify the differences between the regions for infant mortality. If I am wanting to highlight just this variable, then this graphic is probably the best to use as it tells us about this variable alone.
Page 2 of 12
The median value of Infant Mortality in Africa is around 62, whereas the median for Asia is around 25 and for the middle east it is around 12.
The boxplot makes it clear that the bulk of the countries in Africa have infant mortality rates well above those in the other two regions. (5)
f) Write down the ggplot command for creating Figure 1.2 by using the variable names from the background of this section:
a. b. c. d. e. f. g. h.
ggplot(data = countries.df, aes(x = RegionName,
y = infmortality )) +
geom_boxplot() + labs(title = Figure 1.2,
x = Region Name,
y = Infant mortality per 1000 live births)
Note: Full marks for correct code (5) if the title or axis are missing -1 for each name
Page 3 of 12
Question 2 [5 + 5 = 10 marks]
a) Table 2.1 represents a subset of the data for 3 countries in Africa. Suppose subtab.df is the data frame containing the columns of Table 2.1. Implement by hand the following command, thus rewriting this data in long form:
gather(data = subtab.df, key = Measurement, value = Quantvalue, -Name )
Page 4 of 12
(5)
b) Supose you have as input table 2.2. Write down the dplyr commands for calculating the average value for each measurement across the regions. The input data frame is called subtab.df:
subtab.df %>%
group_by(region, measurement) %>% summarise(average=mean(value)) (5)
c) Supose you have as input table 2.2 in the database with the tablename public.subset. Write down the sql commands for calculating the average value for each measurement across the regions.
select region, measurement, avg(value) as avg
from public.subset
group by region, measurement (5)
Table 2.1
Page 5 of 12
Table 2.2
Question3 [(5+2+3)+5 +5+5=25marks]
a) The data set europe.csv provides the values of economic indicators in Europe as shown in table 3.1 Answer the following questions about the code in Figure 3.1
i) Describe the results of implementing the code in Figure 3.1.
a. A csv is read (1)
b. The column Country is transferred to be a row label (1)
c. A data frame is initialized with 20 zeros (1)
d. A loop us run 20 times, within each time kmeans is run with different number of
clusters (1)
e. A table with values for tripplets of k, ss and tot is the results (1)
ii) Why is the scale command used? What does it do ?
a. The scale command standardises all the variables. Standardisation in this case is the usual statistical concept. For each variable, the mean and standard deviation are calculated, then for each value of the variable, the variables mean is subtracted, and the result is divided by the standard deviation so that the resulting variable has mean 0 and standard deviation 1. (1)
b. Standardisation of numerical variables is performed for cluster analysis because the clusters are to be formed based on how different the various variable values are. This should not depend on the scale of the variables. (1)
iii) Figure 3.2 is a plot of ss.df$ss against ss.df$k, what does this graph tell you? Which k would you choose and why ?
a. It shows how much variation is explained by the increase in cluster numbers, at k equals 20 almost 100% is reaches (1)
b. For k: A value between 6-8 (1)
c. Possible Explanations: (1)
i. Around 75% of the variation is explained
ii. Elbow rule the increase in k has a decreasing effect on the benefit of
explaining the variation
Page 6 of 12
Figure 3.1
Table 3.1
Figure 3.2
b) In the context of hierarchical cluster analysis, explain what a linkage method measures. Explain the two linkage methods single and complete.
Linkage methods are methods of defining the distance between clusters. They are calculated based on the distance between pairs of points, one in each cluster. For example, (3)
a. single linkage For two clusters C1 and C2, find the pair of points, one in C1 and one in C2, that are the shortest distance apart. (1)
b. Complete linkage: same idea but find the pair of points, one in each cluster, that are the furthest apart. (1)
c) The Euclidean distances between points A, B, C, D, E and F are shown in Table 3.4. Draw a reasonably accurate dendrogram, including vertical scale, that corresponds to a hierarchical collection of clusters for these points, using single linkage.
Page 7 of 12
Table 3.2
ABCDEF A
B C D E F
0
2
9
12
6
8.5
2
0
7
10
5
7
9
7
0
3
6
4
12
10
3
0
8
6
6
5
6
8
0
2
8.5
7
4
6
2
0
d) Suppose you are performing a cluster analysis using a data set consisting of ten binary variables V1 to V10. Two of the cases are shown in rows C1 and C2 of Table 3.2.
a. Calculate the simple matching and Jaccard measures of similarity for these two cases.
i. Simple matching: 5/10 = 0.5 (1) Jaccard: 3/(10 2) = .375 = 0.38 (1)
b. Also explain how you would decide which of these two measures to use.
i. We choose Jaccard if for most of the variables, two cases having zero for a variable doesnt indicate significant similarity. For example, if 0 means Does not own
c. How do the similarity measures relate to measures of distance between data points?
i. d=1-s(1)
(5)
V1
V2
V3
V4
V5
V6
V7
V8
V9
V10
C1
1
1
1
0
1
1
1
1
0
0
C2
0
1
0
0
1
0
1
0
1
0
Page 8 of 12
Question4 [(3+3+3)+(3+3+5 +5)=25marks]
We consider data form the U.S. Bureau of Transportation Statistics, to predict if an accident will result in injuries based on initial 5 factors that are recorded in the emergency call. The goal of this is to optimize when to send an ambulance or only the fire brigade.
The variables from the codebook are:
vehl_invl:
alchl_i:
mancol_i_r:
rel_rwy_r:
spd_lim:
Number of vehicles involved
Alcohol involved = 1, not involved = 2
0=no collision, 1=head-on, 2=other form of collision 1=accident on roadway, 0=not on roadway
Speed limit, miles per hour
Half the data was randomly selected as training data, and the tree in Figure 4.1 was constructed:
Figure 4.1
(a) Considertheprocessofbuildingtheinitialtree.
i) At the root node, for example, the classification tree algorithm splits the data based on
whether the number of cars involved is below or above 3. How is this choice made?
i. For each choice of variable, and for each value of that variable that is a possible splitting point, the deviance of the resulting tree is calculated, and the variable and splitting point is chosen that minimises the deviance. The deviance may be defined in terms of the Gini or entropy or misclassification index. (2)
Page 9 of 12
ii) Explain how we choose the value of the target variable to assign to a leaf.
i. When the training data is run through the tree, in general for categorical target variable, the category that has the most representatives in that leaf. (2)
iii) What does this mean for any accident report with at least 5 cars involved?
i. Will always predicted as TRUE regardless of the other attributes. (2)
(b) Nowconsiderthecompletedtrees.
i)
ii)
iii)
iv)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
(m)
ii)
How many terminal nodes in the tree?
i. 6(2)
How would you reach the second last terminal node from the right? (TRUE .41 .59)
i. 0 to 2 cars and alcohol is involved (3)
What are the variables and there values to visit the decision node spd_lim >= 48 ?
i. Veh_invl < 3 and alchi_i >=2 and mancol_i_r < 2 and rel_rwy_r=1 (3)Write down the decision rules for this treeroot TRUE (0.49 0.51)veh_invl< 3 FALSE (0.51 0.49)alchl_i>=1 FALSE (0.51 0.48) mancol_i_r>=1 FALSE (0.54 0.46) * mancol_i_r< 2 TRUE (0.48 0.52)rel_rwy_r< 0.5 FALSE (0.53 0.47) * rel_rwy_r>=0.5 TRUE (0.38 0.62)
spd_lim>=48 FALSE (0.65 0.35) *
spd_lim< 48 TRUE (0.21 0.79) * 5) alchl_i< 2 TRUE (0.40 0.59) *3) veh_invl>=3 TRUE (0.35 0.64) *
Note: there needs to be some kind of indention or else statement (3)
i.
Calculate the accuracy for the confusion matrix on the test set. Is a good classifier? Explain why:
Predicted
False
True
False
414
84
True
347
155
Question 5 [6 + 6 + 3 = 15 marks]
An online provider of statistics courses is interested in assessing alternative sequencing and combinations of courses, and therefore wishes to conduct association analysis on its data for past students. Table 5.1 shows a sample of their data, with each row representing an individual student and each column representing a statistics course that they offer as identified by the column headings.
Table 5.1
ID
Intro
Expt design
StatWrite
Survey
DataMining
Cat Data
Regression
Forecast
1
1
0
0
0
1
0
0
0
2
0
1
0
1
0
1
0
0
3
0
1
1
1
1
1
1
0
4
1
0
0
0
0
0
0
0
5
1
0
0
0
1
0
0
0
6
0
0
0
0
1
0
1
1
7
1
0
0
0
0
0
0
0
Page 10 of 12
Actual
801100101
9
1
0
0
0
0
0
0
0
10
0
0
0
0
0
1
1
0
11
1
0
0
0
0
0
0
0
12
0
0
0
0
1
0
0
0
13
0
0
0
0
1
0
0
0
14
0
0
0
1
1
0
0
1
15
0
0
0
1
1
0
1
1
16
1
1
1
1
0
1
0
1
17
1
0
0
0
0
0
1
0
18
1
0
0
0
0
0
0
1
19
1
0
0
0
0
0
0
0
20
0
0
0
0
0
0
1
0
21
0
0
1
1
0
1
0
0
22
0
0
0
0
1
0
1
1
23
1
0
1
1
0
1
0
0
24
1
0
0
0
0
0
1
1
25
1
0
1
0
0
0
0
0
26
0
1
1
0
1
0
1
1
27
0
1
1
0
1
1
1
0
28
1
0
0
0
1
0
1
1
Consider the association rule {Forecast, Regression}a{DataMining}.
(a) Based on the sample provided in Table 5.1, calculate for this association rule
The support of the association rule itemset
o 5/28=0.18(2)
The confidence of the association rule
o 5/6=0.83(2)
The lift of the association rule.
o (5/6)/(12/28)=1.94(2)
(b) Interpret each of the numbers calculated in (a) in relation to the present application, and explain any role
they may have in assessing the usefulness of the association rule.
a. The support of the association rule is the number out of the 28 past students who
undertook Forecast ,Regression and DataMining: only six of them so it may not be the highest priority for making the scheduling convenient. Thus not a particularly useful association rule. Although they are units that go well together so perhaps scheduling should be arranged to encourage doing both. The small numbers could be due to bad scheduling. If this association rule made us aware of bad scheduling, then it is of use after all. (2)
b. Confidence: Among students who do both Regression and Forecast, 5 out of 6 did DataMining. So the association rule has strong confidence meaning that if this continues to be true for a larger sample, we might expect most students who do Forecast and Regression to go on to do DataMining, indicating again that scheduling should be arranged to make this progression easier to arrange.(2)
c. Lift The lift is high, so this usually means the association rule is useful as the presence of the antecedents makes the consequent significantly more likely, again indicating that scheduling should be arranged to suit those who do all three. (2)
Page 11 of 12
(c) If you find out that a student has taken the Forecast and Regression courses, does this make it more likely that they will take the DataMining course? If so, how much more likely?
a. Yes it is a factor of 1.94 more likely. That is, nearly twice as likely. (3)
*** END OF EXAMINATION ***
Page 12 of 12
Reviews
There are no reviews yet.