[SOLVED] R algorithm Scheme python graph network COMP723 Data Mining and Knowledge Engineering

$25

File Name: R_algorithm_Scheme_python_graph_network_COMP723_Data_Mining_and_Knowledge_Engineering.zip
File Size: 800.7 KB

5/5 - (1 vote)

COMP723 Data Mining and Knowledge Engineering
Assignment 2 Text Classification (50%)

Objective
To develop a broad understanding of text mining by performing a representative task, Text Classification.
Collaborative Learning Requirement

As part of the assignment you will be required to demonstrate SUSTAINED collaborative approach into completing the assignment in pairs. This requires you to find a partner to work with in the 7th week of the semester and register into one of the groups created for collaboration on Blackboard. After this you will have to use the groups discussion board, email and other tools to communicate, discuss, strategize, distribute work and share documents in order to produce the final assignment. This forum will be used to evaluate you on your contribution and research work carried out for the assignment. Note that your activity is time stamped, hence this will be used as evidence of sustained, collaborative learning.

The first page of your report should include a half page summary of the activities of each partner through the second half of the semester, up to the submission of the assignment as illustrated on the Blackboard Discussion Board. This should have the title Contributions group name.

Task Specification
This assignment requires you to extend your data mining skills and knowledge from the structured context to unstructured context, where the items to be classified are free text snippets.You are required to use a chosen text mining tool to train two different classification algorithms on the given dataset, analyse the results, and present a report of your findings.
Due Dates and Submission
This assignment is to be done in pairs.Only one person from the pair should submit the assignment.The report should clearly state the name and student ID of both members of the team. Furthermore, the contributions made by each team member must be clearly stated in the section Contributions group name at the beginning of the report.
The written part of your assignment is due on 22 October at midnight.
You are required to submit only an electronic copy of the assignment via the Turnitin assignment Submission tab (on the course homepage) on Blackboard. Only one member from the pair needs to submit the assignment.
Marking
This assignment will be marked out of 100 marks and is worth 50% of the overall mark for the paper.
To pass this module you must pass each assessment separately, and gain at least 50% in total. The minimum pass mark for this assignment is 40%.
Assignment Details
The objective of the assignment is to classify text into two categories. The tasks described are generic to text classification so that you are able to use R, Weka or Python to get the results. The class labs will all be done using Python, so if you want to use either R or Weka, you will have to learn the tools on your own.

Dataset

The data set to be used for this assignment is available from Blackboard as part of the assignment package. The data set is a large corpus of emails organised into 5 folders named enron1, enron2, enron3, enron4 and enron5.Each of these folders contains two folders named ham and spam containing emails belonging to each of the two categories. The package also contains two papers which gives you a background on the dataset and examples of use for text classification using Nave Bayes and Support Vector Machine. You will need to acknowledge the use of this dataset appropriately in your report.

Assignment Tasks

Find a partner to work with and enrol in a group on Blackboard. This should be done in week 7 of the semester.

Download the zipped file containing the dataset from Blackboard under the Assignment 2 folder. Unzip it into a working folder which you will use for this assignment.The zipped file contains a total of 5 folders as described above.The files represent 5 sets of data consisting of emails classified into ham and spam.

Choose whether you want to use Python, R or Weka for this project.

The objective of this assignment is to compare the performance three classification algorithms for the task of text classification. You will compare two given algorithms and third one will be of your choice. The two algorithms are:
Nave Bayes
Neural Network
Your choice. Some examples are SVM, CRF, J45, etc.
You task is to conclude which of the three algorithms is the best for text classification.
To do this you can use any combination of the pre-processing tasks in order to build features to be used for the two machine learning algorithms. They dont need to be consistent for the two algorithms.

In order to produce valid conclusion, you should do test by slicing the data for your experiments in the following 2 ways:
Conflate the data from the 5 folders and make them into one dataset. Then split the conflated dataset into 70% training set and 30% test set while maintaining the ham:spam ratio. Use these for training and testing for both the algorithms.

Use environ1, environ3 and environ5 for training and environ2 and environ4 for testing. Use these for training and testing for all the algorithms.

You should also experiment with various forms and combinations of features as covered in lectures and your own online research. Your strategies and decisions should be backed by systematic testing, hence a rationale, and this should be discussed within the group using the Group Discussion Board tools so that it can be evaluated.

You should report all performances in terms of Precision, Recall and F-values.

Written Report
You will write a minimum of 6 and a maximum of 12 page report (excluding the references and appendix) describing the results of your experiment.
You are required to write a coherent report describing all aspects of the experiment as an attempt to prove or disprove the hypothesis. Any screen shots or large result outputs that doesnt directly contribute to your argument should be included in the appendix, rather than as part of the main report.
You are also required to submit well documented code as part of the appendix.
You are not required to have a table of contents or executive summary for this report.
There is no fixed format for the report.You can format it close to an academic paper containing the usual sections such as Abstract, Introduction, Data Description, Results, Discussion, Conclusion and a bibliography.
As a minimum your report should contain a discussion of the following points
A brief introductory discussion of applications of text classification.
A description of the dataset and its characteristics.
A discussion of the similarity and the differences between the three classifiers that you are comparing as applicable to text classification.
The differences in the manner in which classifiers are applied in a structured data scenario (such as what you did for the first assignment) and a non-structured text mining scenario.
Presentation and discussion of the results obtained.You should use the correct evaluation metrics in your discussion.This part of your write up should include:
The effect of the variations of the dataset used.
Your perception of the possible rationale for doing the tasks.
A thorough discussion of the comparison of the results leading to the conclusion that answers the hypothesis.

A reflection of what you learnt from this assignment and what you would do differently if you were to do the assignment again.

Marking Scheme
The following approximate matrix would be used to grade your assignment.

Written Report

Formatting, Language and Presentation
10%
Discussion to demonstrate an understanding of the experimental tasks in the context of text mining
25%
Satisfactory completion of the tasks for the hypothesis.
25%
Discussion and presentation of the results leading to the conclusion.
30%
Use of collaboration to accomplish the task
10%

**********************End of Assignment Specification**********************

PAGE

PAGE3

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] R algorithm Scheme python graph network COMP723 Data Mining and Knowledge Engineering
$25