COMP60711: Data Engineering Part 2 Coursework 2 with thanks to Dr Paris Yiapanis
Make sure you justify your answers with technical evidence when in doubt give details!
1. Clustering (12 marks)
This part looks at clustering, a (unsupervised) learning technique not covered in class. Your goal is to understand the basics of a simple clustering algorithm, apply it to a sample dataset and draw conclusions about your findings.
Begin by reading and studying the material on clustering, found in slideset: dm13-clustering-abbreviated.pdf
on the assignment area in Blackboard
1.1. Consider the questions found on slides 17 and 18:
i. Results obtained using K-Means can vary significantly depending on initial choice of seeds (number and position). Specifically, the algorithm can get trapped in a local minimum. Explain your understanding of this statement, i.e. why does this problem occur? Use a graphical example (not the one from the slides) and a technical argument to support your answer. (4 marks)
1.2. Use Weka to cluster the following dataset: baskball.arff (available from moodle) using the SimpleKMeans method.
ii: Incrementally increase the number K of clusters, from 2 up to 50. What you observe regarding the sum of squared errors metric? How would you explain your observation? (3 marks)
iii: Generate a cluster model with k=3 clusters. What can you observe regarding players that have played for up to 25 minutes? (hint: after generating the model, visualise the cluster assignments by right-clicking on the model plot Clusters vs time_played). (2 marks)
iv: Using the cluster information you generated in the previous step, as a basketball coach you are asked to select your preferred cluster by considering the elements within the clusters. Use the cluster visualisation facility to justify your selection. In addition please provide the Clusterer output given by weka (i.e. all details about all clusters this info is in the text area of weka after you generate the cluster model) (3 marks)
2. Association rules: Mining a real-world dataset (8 marks)
Consider a real-world dataset, vote.arff, which gives the votes of 435 U.S. congressmen on 16 key issues gathered in the mid-1980s, and also includes their party affiliation as a binary attribute. This is a purely nominal dataset with some missing values (corresponding to abstentions). It is normally treated as a classification problem, the task being to predict party affiliation based on voting patterns. However, association-rule mining can also be applied to this data to seek interesting associations.
i. Run Apriori on this data with default settings. Comment on the rules that are generated. Comment also on their support, confidence, and lift. (5 marks)
ii. It is interesting to see that none of the rules in the default output involve Class = republican. Why do you think that is? (3 marks)
NOTES
- More information on the data appears in the comments in the ARFF file.
- If you wish, you could also use the Visualize tab in Weka Explorer to visualize
how the instances are distributed (e.g. plot Class vs one of the important attributes or compare two important attributes and see how they are distributed across the classes). You might find it helpful to increase the jitter to expose hidden instances.
- You could even cluster the data and visualize the clusters vs class.
Reviews
There are no reviews yet.