MASTER OFCOMPUTER AND INFORMATION SCIENCES
COMP809
Data Mining & Machine Learning
ASSIGNMENT TWO
DATA MINING PROJECT
Semester 2, 2019
Due: Monday 22 October at 12 midnight
Weighting:70%
This assignment represents the major piece of work on this course, and accordingly carries 70% of your course mark. The major requirements involved are: experimentation with one or more Mining packages (not necessarily restricted to the ones used in this course) and post experiment analysis. Some basic programming may be needed to pre-/post- process the data.
The results and analysis of your investigation will be presented in written form. The written report will need to include the following:
Full description of the application domain, including a clear statement of the overall purpose of the Mining exercise that you have performed.
A description of the data set that you targeted, together with any transformations that were required to pre-process the data for the Mining exercise. Ensure that you present samples of both raw data and pre-processed data.
An explanation of how your selected Mining algorithms work (no more thanhalf of a page for each algorithm).Ensure that you draw your material from a variety of sources. Your description of algorithms should not just be a passive description of how the algorithms work but should also include an examination of its strengths and limitations.
Detailed experimental study of the performance (might be any performance dimension, e.g. computational complexity, usability, visualization of the outcome) of the selected Mining algorithms on your selected data set. Include actual outputs from your Mining software as supporting evidence. You will need to give your own definition of performance which may include a variety of factors. Your performance measures must take into account aspects of accuracy and time.
In your experimental study you should run several different experiments that will consist of various combinations of algorithms, and pre-processing methods (e.g. different methods of feature selection, different combinations of parameter values). For example, if you experiment with a different algorithms, f different feature selection methods and p different sets of parameter values and b different boosting methods then you will run a total of t=a*f*p*b different experiments. The number t can be quite large depending on your choice of a, f, p and b. Thus, as part of your report you will need to produce an experimental plan describing what strategy you used to keep the total number of experiments down to a manageable number whilst not sacrificing performance.
An analytical (this can include statistical methods) comparison of the performance of the algorithms, together with an explanation of the superior performance of the winner. You may use the Experimenter module in Weka for this purpose. Your analysis should also include suitable visualizations (model diagrams, ROC curves, whatever is appropriate) that compare the performances of your winner and runner-up algorithms. Your winner should then be compared to any significant (data mining) work previously undertaken on the data set you selected (if any). In your experimental study you will have defined a number of different performance measures and these measures should (a) be used on their own and (b) combined into a single measure. To combine several measures into one use a linear weighted model, with weights to be supplied by yourself, backed up by suitable justification.
Your report size should not exceed 15 pages (with standard single spacing and a font size of 12). Summarize results in the form of tables, graphs and other methods of visualization. There is no need to include detailed Weka screen screenshots.
Choice of Datasets
You are given a choice of two datasets. Use only one of them. No other dataset should be used.
Internet Advertisements (3279 samples, 1558 features)
HYPERLINK https://archive.ics.uci.edu/ml/datasets/Internet+Advertisementshttps://archive.ics.uci.edu/ml/datasets/Internet+Advertisements
This dataset represents a set of possible advertisements on Internet pages. The features encode the geometry of the image (if available) as well as phrases occurring in the URL, the images URL and alt text, the anchor text, and words occurring near the anchor text. The task is to predict whether an image is an advertisement (ad) or not (nonad).
There are two main challenges with mining this dataset: high dimensionality and class imbalance. Both of these challenges will need to be overcome through suitable pre processing of the dataset.
Annotation Face Data Set (606 samples)
The Annotation Face database contains the ground truth for a data set of images in table format, where the columns are values for face position, right eye position, left eye position, nose position, right corner of mouth and left corner of mouth.
Apply three clustering approaches (K-means, DBScan and EM) to cluster the data set and compare their performance using the Dunn index (DI) and cluster Silhouette measures. Evaluate the clusters with suitable visualization tools and identify the best cluster configuration for each clustering algorithm on the basis of your visualization. In your post processing analysis, explain the contribution of the features to each cluster.
Some Notes on Experimentation
Although Weka is capable of using both .arff and .csv files, I recommend that you use .arff files only. To convert a .txt file into .arff, two steps are involved. First read the .txt file into MS Excel, format it into columns, and then save the formatted version using the .csv option. Read the .csv file into Weka and then save it as an .arff file. Now open the .arff file and start working.
Make sure that you read the Exemplar for Assessment 2 that are available in Blackboard. These are available through the Assessments/Assessment 2 folder in AUTonline.
Make use of Weka resources as much as possible Weka Explorer and Experimenter guides are available in AUTonline and there is also a Weka forum online.Ask me for help by email if you require assistance, but plan well in advance. Because of the number of experiments and the written report you must not leave it till too late. Start work in the week (Week 7) that the assessment is handed out and do some initial experimentation as soon as possible to get a feel for the datasets and/or algorithms that you would like to work with.
Read through the accompanying document (Experimental Framework) for suggestions on a generic experimental plan. The experimental plan is the key to successfully completing this project on time. The plan suggested needs to be adapted to the dataset you have selected as not all activities in the plan may be applicable for the dataset that you have selected to mine.
Marking Scheme
The marking for this assessment will be done against four major headings, namely Overall Document Quality, Description of the Mining schemes used, Experimental Study, and Post-Processing Analysis. The detailed breakdown of the mark is as follows:
1. Overall Document Quality (10%)
Structure
Clarity of presentation
Referencing
2. Mining Schemes used (20%)
Justification for using schemes
Description of Mining algorithms used include necessary theory, algorithm strengths, limitations
3. Experimental Study (50%)
Overall Experimental Plan
Pre-Processing (Feature Selection, Missing Value Estimation, Normalization, etc, as appropriate)
Parameter Tuning
Definition of Performance metric
Use of performance boosting techniques (if applicable to your project)
4. Post Processing Analysis (20%)
Analytical techniques used for performance comparison (between the algorithms you used)
Your interpretation of why your winner algorithm was better than your runner up. You must include your justification based on appropriate performance metrics
The success (or otherwise) of the application of performance boosting methods
Comparison of your results with previous research on the same dataset (if appropriate).
PagePAGE 5 ofNUMPAGES * ARABIC 5
Reviews
There are no reviews yet.