This project aims at familiarizing the concepts of data mining learnt through the course to provide some insights into the topic of interest. With the available software (in Java) such as Weka (http://www.cs.waikato.ac.nz/ml/weka/) and others; and analyzing the datasets obtained from public databases.
Another aim is to promote teamwork and self-learning: Working in a team pays big dividends, it is less stressful and peer support does make learning much easier. Furthermore, in line with NTU vision to Teach less, learn more, you are encouraged to take the basic skills and principles, coupling with self-reading to handling real work problems.
- GENERAL GUIDELINES:
- This project is a required part of the course. It shall account for the main coursework component in your final grade.
- This project is to be accomplished in a group of Maximum Six (6) students. A supportive and conducive environment within each group is beneficial; hence you are free to determine your group members. While the accomplishments must genuinely belong to your group, you are free to have discussions with me, and most importantly, your classmates. The utilization of the discussion board in the NTULEARN is strongly encouraged. Bonus Marks will be awarded to students who have taken a contributing role in the discussion of this course, who has been interactive and has demonstrated generosity in assisting others in understanding.
This is a special thanks to these outstanding individuals and an encouragement to others.
- Project Topics for Data Analytics and Mining
Project Objective
The students are expected to practice hand-on skills for how to perform a real-world Data Analytic task from the beginning (data pre-processing data collection, cleaning, etc) to the final stage (data post-processing evaluation and presentation, etc ).
You are encouraged to test your approach using existing tools such as Weka, R, Matlab toolbox, etc. After familiarizing the basic steps and see the results, you are encouraged to write your scripts and you may submit it together with your report. If no scripts are writing, you can provide the details steps and commands in deriving your answers.
Before You Start
Plenty of tools are available for data mining tasks using artificial intelligence, machine learning and other techniques to extract data. It is recommended you choose one of the many freely available tools for data analysis. Simply search from the internet will return many lists of popular open-source data mining tools available, for example:
- https://thenewstack.io/six-of-the-best-open-source-data-mining-tools/
- https://www.softwaretestinghelp.com/data-mining-tools/
As a start, you may try the Data Mining and Predictive Analytics training course using the open-source Weka tool. The tool is sophisticated and used in many different applications including visualization and algorithms for data analysis and predictive modeling. It is free under the GNU General Public License. Videos are produced by the University of Waikato, New Zealand. Who also authored the book: Ian H. Witten, Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques, Elsevier, 2005
Weka Predictive Analytics Tutorial
https://www.youtube.com/watch?v=Fg2x_zM3YTo&list=PLzVF1nAqI9VmC96TbvOPMkX ToSmBMHJn7
Pay attention to the limitation of free and online tools in handling datasets. These will affect your selection and increase the challenge and complexity: Algorithms on a large dataset will run very slow. And the tasks on a very small dataset may not be reflective of your capability.
IV. Suggested Sources
The datasets for your consideration are in the excel spreadsheet.
As everyone might have chosen different data mining tools and dataset, you may consult my TAs who has been assigned to each dataset. TAs will be able to help you with generic problems, data and problem-specific issues may need time to resolve, be kind to my TAs.
Reminders
- You are NOT allowed to COPY code/report directly from others / Internet (unless specified for special cases). Any plagiarism case will be seriously punished!
V. Suggested Approaches
What do you do when you have a large dataset?
Use sampling to select a subset of the data can reduce the data size to be analyzed. Obtaining the entire set of data of interest is too expensive and its analysis timeconsuming. And if you find out your approach does not work, you have not wasted too much time.
Preparation
When you have a large dataset, you could search the recently published articles which cite or relate to this dataset. It will help to obtain information, such as the problems that are often proposed in this dataset, the evaluation matrices, the statue-of-art performances, comparable benchmarks, and the possible solutions.
Data mining
After you finished the analysis of the used dataset, other related datasets can also be considered to find more interesting relations.
Result evaluation
During the result evaluation, besides the commonly used performance tables or curves (e.g. accuracy curve), more intuitive result visualization methods can also be considered, such as the scatter plot of the predictions and the input data. This can be achieved using techniques known as dimensionality reduction with popular tools, such as PCA and t-SNE, for visualizing high-dimensional datasets.
Reviews
There are no reviews yet.