Lab 1
This is the first lab. It covers both classification and regression and is composed of four tasks, each consisting of several subtasks, described below. The lab will test you on your problem solving skills using classification and regression techniques, as well as your ability to use different platforms to achieve similar outcomes.
What you should already know
KNIMEPython: If you have not worked with KNIME or Python before, it is important that you have finished the exercisetools and do the second and third exercise at your earliest convenience.
What you will learn in this lab
This lab will introduce you to a number of new things:
Exploring and preprocessing datasets
Using aggregations and graphs to get to know your data
Solving classification and regression tasks using different machine learning techniques
Solving classification and regression tasks using different platforms
Using a methodological approach to evaluation
Classification
The data set can be found and retrieved here or here.
Introduction to the dataset
We will use the classic Titanic dataset. The data consists of demographic and traveling information for 1,309 of the Titanic passengers, and the goal is to predict the survival of these passengers. The full Titanic dataset is available from the Department of Biostatistics at the Vanderbilt University School of Medicine. The Encyclopedia Titanica website https:www.encyclopediatitanica.org is the website of reference regarding the Titanic. It contains all the facts, history, and data surrounding the Titanic, including a full list of passengers and crew members. The Titanic dataset is also the subject of the introductory competition on Kaggle.com https:www.kaggle.comctitanic, requires opening an account with Kaggle and does not contain target data for all instances.
The Titanic data contains a mix of textual, Boolean, continuous, and categorical variables. It exhibits interesting characteristics such as missing values, outliers, and text variables ripe for text mininga rich database that will allow us to demonstrate data transformations.
Heres a brief summary of the 14 attributes:
pclass: Passenger class 11st; 22nd; 33rd
survival: A Boolean indicating whether the passenger survived or not 0No; 1Yes; this is our target
name: A field rich in information as it contains title and family names
sex: malefemale
age: Age, asignificant portion of values aremissing
sibsp: Number of siblingsspouses aboard
parch: Number of parentschildren aboard
ticket: Ticket number.
fare: Passenger fare British Pound.
cabin: Doesthe location of the cabin influence chances of survival?
embarked: Port of embarkation CCherbourg; QQueenstown; SSouthampton
boat: Lifeboat, many missing values
body: Body Identification Number
home.dest: Homedestination
Take a look at http:campus.lakeforest.edufrankFILESMLFfilesBio150TitanicTitanicMETA.pdf for more details on these variables.
We have 1,309 records and 14 attributes, three of which we will discard. The home.dest attribute has too few existing values, the boat attribute is only present for passengers who have survived, and the body attribute is only for passengers who have not survived. You will have to begin by removing these attributes.
Subtasks to be performed in KNIME and Python
Before solving the tasks, make sure you have done the following:
Read up on the data so that you understand the problem
Use the discussion forum on Kaggle for some further input
Look at the kernels at Kaggle for suggestions on how to get to know the data and solve part of the subtasks
Load all necessary data into KNIMEPython
Get to know your data using aggregations and graphs
You may use the python notebooks available at Kaggle as your python testbed.
Perform the following subtasks in both KNIME and python and report the results
Build a transparent classifier e.g. a decision tree using all data and identify the most important attributes.
List the top 5 attributes that you identify as most important
Motivate your selection
Evaluate three different kinds of classifiers and compare the results using both accuracy and AUC Area Under ROC Curve.
Use a proper evaluation methodology
Motivate the setup used for evaluation
Identify the most appropriate classifier for the problem
Optimize the parameters of the identified classifier from 2.c using AUC as optimization criteria
Handle the class imbalance problem on the training set and train a classifier using the setup found in 3 with the original data and the manipulated data, compare the performance using precision and recall.
Motivate which setup is most suitable for the task
Regression
Datasets
In this part, you will practice algorithm evaluation on a larger scale. When discussing evaluation and comparisons of classifiers there are three major questions:
How should the future error rate i.e. on novel data of a specific classifier be estimated using only results on available data?
What performance can we expect on new data?
How should the results of two classifiers or two different algorithms be compared against each other on a specific data set?
What algorithm works best on my problem?
How should the results of several classifiers or algorithms be compared against each other over several data sets?
Valuable for research and method development purposes
You will use datasets from the delve repository. Only use datasets with task type set to R, i.e., only use regression datasets for this part. You can also use the set of datasets made available on canvas.
For more information on evaluation and statistical comparisons, read the paper by Demar 2006.
Subtasks to be performed in KNIME and Python
Select a dataset from the repository, select a suitable algorithm to evaluate and use the holdout method to estimate the future performance.
What performance can be expected on your problem?
How confident can you be that your estimate is close to the true performance?
How does the size of the trainingtest sets affect the reliability?
Select a dataset from the repository, select a suitable algorithm to evaluate and use cross validation to estimate the future performance.
What performance can be expected on your problem?
How confident can you be that your estimate is close to the true performance?
Which result is most reliable, the results from 1 using the holdout method or 2 using cross validation? Why?
Select a dataset from the repository and compare the performance of two different algorithms on the same dataset.
Which algorithm works best?
Is the difference significant in a statistical sense?
Select at least 10 different datasets from the repository and compare the performance of at two different algorithms on all 10 datasets
Which algorithm works best?
Is the difference significant in a statistical sense?
Submission
Upload your KNIME solutions as well as your python solutions for both the classification and the regression tasks. Submit a PM with your answers and motivations to the questions asked above. If your answer is the same for the KNIME and Python solutions, you do not have to comment both, but if they differ, reflect upon how and why.
Reference
Demar, Janez. Statistical comparisons of classifiers over multiple data sets.Journal of Machine learning research7.Jan 2006: 130.
Reviews
There are no reviews yet.