Task description. In the Optical Character Recognition (OCR) we seek to predict a number (between 0 to 9) for a given image of handwritten digit. In this assignment we simplify the OCR into a binary classification task. Specifically, we consider predictions for only two numbers 3 and 5. The goal in this assignment is to develop variations of the perceptron algorithm to classify handwritten digits of numbers 3 or 5.
Data. All the handwritten digits are originally taken from http://www.kaggle.com/c/digit-recognizer/data The original dataset contains the sample digits suitable for OCR. We extract samples with only labels 3 and 5. Following a little pre-processings we produce three datasets for this assignment as follows:
- Train Set (pa2 csv): Includes 4888 rows (samples). Each sample is in fact a list of 785 values. The first number is the digits label which is 3 or 5. The other 784 floating values are the the flattened gray-scale values from a 2d digital handwritten image with shape 28 28.
- Validation Set (pa2 valid.csv): Includes 1629 rows. Each row obeys the same format given for the train set. This set will be used to select your best trained model.
- Test Set (pa2 test.csv): Includes 1629 rows. Each row contains only 784 numbers. The label column is omitted from each row.
Important Guidelines. For all parts of this assignment:
- Please assign labels +1 to number 3 and -1 to label 5. In your produced predictions, please use only +1 and -1 as labels not 3 and 5.
- Please do not shuffle the given data. While in practice shuffling should be used to improve training convergence, for this assignment we ask you not to shuffle the data to ensure determinstic results for assessment purpose.
- To simplify the notation in this assignment, your load function which loads train, validation and test dataset should add a bias feature to the dataset. The bias feature is a feature with value of 1.0 for all of the samples. Therefore the feature size for each samples will become 785.
Part 1 (20 pts) : Online Perceptron. In the online perceptron algorithm we train a linear classifier with parameter w to predict the label of a sample with equation:
y = sign(wTx) (1)
Where y {1,1}. Algorithm 1 describes the online perceptron.
Algorithm 1 Online Perceptron
1: procedure OnlinePerceptron2: w0 0 | |
3: | t 0 |
4: | while iter < iters: |
5: | for all sample xt in train set: // no shuffling |
6: 7: | if ytut 0: |
8: | wt+1 wt + ytxt |
9: | t t + 1 |
Where xt is the sample at time step t and yt is its correct label. As we can see, the weight at time step t+1 is equivalent to below summation:
wt+1 = X yixi (2)
xiSt
where St is a list containing all the previous samples that have been incorrectly classified by the model (some example may appear multiple times). Therefore the prediction at time (t+1) can also be given by:
yt+1 = sign(( X yixi)Txt+1) (3)
xiSt
In this part we are interested in the following experiments for the online perceptron algorithm:
- Implement the online perceptron model with algorithm described in Algorithm 1. Set the iters = 15. During the training, at the end of each iteration use the current w to make prediction on the validation samples. Record the accuracies for the train and validation at the end of each iteration. Plot the recorded train and validation accuracies versus the iteration number.
- Does the train accuracy reach to 100%? Why?
- Use the validation accuracy to decide the test number for iters. Apply the resulting model to make predictions for the samples in the test set. Generate the prediction file oplabel.csv. Please note that
your file should only contain +1 (for 3) and -1 (for 5) and the number of rows should be the same as pa2 test.csv.
Part 2 (20 pts) : Average Perceptron. In this part we are interested to utilize average perceptron to deal with some issues regarding the online perceptron. Algorithm 2 describes the average perceptron.
Algorithm 2 Average Perceptron
1: procedure AveragePerceptron
2: w 0
3: c 0
4: w 0 // keeps running average weight
5: s 0 // keeps sum of cs 6: while iter < iters:
7: for all sample xt in the train set: // no shuffling
8: ut sign(wTxt) 9: if ytut 0:
10: if s + c > 0 :
11: w w
12: s s + c
13: ww + ytxt
14: c 0
15: else: c c + 1
16: if c > 0:
w
17:
As shown in the Algorithm 2, we compute a running average w which is used to predict the label of any sample xi as follows:
y(x) = sign(wTxi) (4)
We are interested in below experiments for average perceptron:
- Please implement the average perceptron described in Algorithm 2.
- Plot the train and validation accuracies versus the iteration number for iters = 1,,
- How average model has affected the validation accuracy comparing to the online perceptron?
Part 3 (40 pts). Polynomial Kernel Perceptron. The online/average perceptron in Algorithm( 1 and 2) are linear models. In order to increase the models complexity, one can project the feature vectors into a high (or even infinite) dimensions by applying a projection function . In this case the prediction at time (t+1) is given by:
yt+1 = sign((wt+1)T(xt+1)) Or equivalently by: | (5) |
yt+1 = sign(( X yi(xi))T(xt+1)) | (6) |
xiSt
Where St is a list containing all the previous sample vectors for which the model has a wrong predictions (there can be repetitions for a sample). Simplifying above equation yields:
yt+1 = sign( X yi(xt+1)T(xi))) (7)
xiSt
As we see in the equation 7, the prediction at each time steps utilizes a similarity terms i.e. (xt+1)T(xi) between the current projected sample (xt+1) and all the previous samples which we have incorrect predictions.
Lets define a kernel function with parameter vectors x and y to be k(x,y) = (x)T(y). Therefore the equation is simplified to:
yt+1 = sign( X yik(xt+1,xi)) (8)
xiSt
There are different kernel functions. For example to compute the similarity between vectors x and y in the polynomial space (with polynomial degree p) we utilize below kernel:
kp(x,y) = (1 + xTy)p (9)
For p = 1 we have a linear kernel. In this part we are interested in polynomial kernel perceptron as described in Algorithm 3:
Algorithm 3 Kernel (polynomial) Perceptron
1: procedure KernelPerceptron2: N : number of train samples,F : number of features | ||
3: | XNF train set | |
4: | N1 0 | |
5: | yN1 train labels | |
6: | kp(x,y) = (1 + xTy)p | |
7: | KNN(i,j) = kp(xi,xj) | xi,xj X |
8: 9: | while iter < iters: | |
10: | for all sample xi X: // no shuffling | |
11: | u = sign(Pj K[j,i][j]y[j]) | |
12: | if yiu 0: | |
13: | [i] [i] + 1 |
Please consider below experiments:
- Implement the polynomial kernel function kp in the Algorithm 3. This function takes two vectors x1 and x2 and an integer p for the polynomial degree, and returns a real value.
- Define a Gram matrix K with size N N where N is the number of training samples. Fill matrix K(i,j) = kp(xi,xj) for all of the pairs in the training set.
- Implement the rest of the kernel perceptron in Algorithm 3. For each p in [1, 2, 3, 7, 15]:
- Run the algorithm to compute .
- At the end of each iteration use the current to predict validation set.
- Record the train and validation accuracy for each iteration and plot the train and validation accuracies versus the iteration number.
- Record the best validation accuracy achieved for each p over all iterations.
- Plot the recorded best validation accuracies versus degrees. Please explain how p is affecting the train and validation performance.
- Use your best (the best you found over all d and iterations above) to predict the test data-set. Please name the predicted file as kplabel.csv.
Reviews
There are no reviews yet.