[SOLVED] RCOMP60711: Data Engineering Part 2 Coursework 1

$25

File Name: RCOMP60711:_Data_Engineering__Part_2_Coursework_1.zip
File Size: 461.58 KB

5/5 - (1 vote)

COMP60711: Data Engineering Part 2 Coursework 1 (20 marks) with thanks to Dr Paris Yiapanis

Make sure you justify your answers with technical evidence when in doubt give details!

1. Classifier: Behaviour (8 marks)
Note: this is the same as assignment #2 from the Data Mining course available at: http://www.kdnuggets.com/data_mining_course/assignments/assignment-5.html

Start with genes-leukemia.csv dataset used in lab #1. (file is on moodle).
As a predictor use field TREATMENT_RESPONSE, which has values Success, Failure or ? (missing).

Step 1. Examine the records where TREATMENT_RESPONSE is non-missing
i: Count the number of such records, and describe these records using other sample fields (e.g. Year from XXXX to YYYY , or Gender = X, etc) (1 mark)
ii: Explain why is it not correct to build predictive models for TREATMENT_RESPONSE using records where it is missing? (1 mark)

Step 2. Select only the records with non-missing TREATMENT_RESPONSE. Keep SNUM (sample number) but remove sample fields that are all the same or missing. Call the reduced dataset genes-reduced.csv
iii: Which sample fields should you keep? (1 mark)

Step 3. Build a J48 Model using 10-fold cross validation.
iv: show the tree and compute the expected error rate (1 mark)
v: what are the important variables and their relative importance, according to J48? (1 mark) vi: Remove the top predictor and re-run the J48 what do you get and why? Report the new tree and error rates. (1 mark)
vii: Based on the results in (vi), do you think the tree that you found with the original data is significant? Justify your answer. (2 marks)

2. Classifiers: Accuracy (4 marks)
i. Compare classifiers ZeroR and OneR against J48 on multiple datasets in terms of accuracy: (2 marks)

  1. Start the Experimenter.
  2. Add datasets: iris, breast-cancer, credit-g, diabetes, glass, ionosphere, segment-

    challenge.

  3. Add classifiers J48, ZeroR, OneR (in this order).
  4. Leave settings as 10-fold cross-validation and 10 repetitions.
  5. DO NOT write the results to a file.
  6. Run the experiment and analyse the results.
  7. Show the results table produced by Weka and discuss how ZeroR and OneR compare

    against J48 on each dataset used in this experiment.

ii. Compare OneR classifier against ZeroR in trems of accuracy: (2 marks)

  1. Go back to the Experimenter.
  2. On the Analyse tab find the Test base option and select OneR.
  3. Now the other two classifiers will be compared against OneR.
  4. Click Perform Test.
  5. Show the results table produced by Weka and discuss how OneR compares against

    ZeroR on each dataset.

3. Classifiers: Training Time Comparison (4 marks) Generate artificial datasets of different sizes:

  1. Open Weka GUI chooser.
  2. Click on Explorer.
  3. Under the Preprocess tab click the generate button.
  4. Click Choose and select classifiers>classification>LED24.
  5. Once LED24 is selected click next (to the right) to the choose button to configure the

    parameters of the generator.

  6. In the numExamples field insert 100000.
  7. Click Generate. It may take a few seconds to generate the file.
  8. You have just generated an artificial file for classification with 100K instances.
  9. Click the Save button and save the file on your disk under the name led100K.arff
  10. Repeat the same process and generate datasets for 200000, 300000, 400000,

    500000 instances with names led200K.arff, led300K.arff, led400K.arff, and

    led500K.arff respectively.

  11. Close the Explorer.

Run the classifiers on all the datasets:

  1. Start the Experimenter.
  2. Click New.
  3. For experiment type choose Train/Test Percentage Split (data randomized). DO NOT

    choose cross-validation otherwise it will take too long to complete.

  4. Also for number of repetitions choose 1.
  5. Add the algorithms J48, NaiveBayesSimple.
  6. Add the five datasets generated in the previous step.
  7. For the results destination choose CSV and the name & destination of the output file.
  8. Run the experiment. This may take a few minutes.
  9. Examine the file with the results.

(i) Plot the training time for each classifier (Elapsed_Time_training column from file) against data size (i.e. number of instances used). Explain what you observe and your understanding in terms of training time and data size (include graph in your answer) (2 marks)
(ii) What do you think would happen if we continue increasing the no of instances? Which of algorithm would be more suitable for very large numbers of instances and why? (2 marks)

4. Classifiers: Memory Usage Comparison (4 marks)
Copy the file J48MemTest.jar to the directory that contains the datasets created in Q3. J48MemTest.jar creates a classification model using J48 and measures memory consumption during that process. Open the terminal and move to the directory containing your datasets. Run the following command to get the memory usage for dataset led24_100_000.arff

java -jar J48MemTest.jar led24_100_000.arff
Record memory usage in MB. Repeat the experiment for the remaining datasets (100K-500K). Plot memory usage of J48 against the data size (i.e. number of instances used). Explain the memory usage (include graph in your answer) (4 marks)

Further Reading: In the big data context, an issue is that many algorithms try to fit all data in memory to create the classification model. For example, try to generate 1M instances of the LED24 dataset and run J48; the algorithm crashes (out of memory using Wekas default 1GB memory settings). A solution is to use incremental (aka updatable) algorithms that process one instance at a time rather than loading the entire dataset in memory. If you are interested play with this using Wekas command line interface (SimpleCLI) and run the incremental versions of algorithms the tool provides.

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] RCOMP60711: Data Engineering Part 2 Coursework 1
$25