: R StudioPCA (Principal Components Analysis )KNN ()
Individual Project: PCA + kNN Approach in Supervised Handwritten Digit Recognition
Data: A file, project.RData is available. This file contains the following three R objects:
Copyright By Assignmentchef assignmentchef
trainD : A 2000 x 784 matrix with each row storing the 28 x 28 pixels of a handwritten digit. Hereafter, we call matrix with this structure a digit data matrix.
TDigit : A vector of size 2000 with the i-th element being the digit corresponding to the i-th row of trainD. Hereafter, we call vector with this structure a digit vector.
printDigit(v,d=NA) : A function to print a digit image. Argument v is a vector of length 784 for a handwritten digit, and d is the digit for the vector v (default is NA). For example, printDigit(trainD[3,],TDigit[3]) displays the digit image in the 3rd row of trainD.
(: PDF 1R, R2function)
(a) A pdf file of report: Not more than five A4 size pages excluding appendices. It is recommended to put tables, figures, and program listing in appendices.
(b) One R file containing two functions, Prepare and Classify as specified below (sample Prepare and Classify functions are given in Appendix C). No global variable can be used in the functions.
(i) Prepare(trainData,DigitV)
Input: (i) trainData: a digit data matrix; (ii) DigitV: the corresponding digit vector
Output: A list containing all necessary information to be used in the Classify
(ii) Classify(QueryData,OutPre)
Input: (i) QueryData: a digit data matrix to be classified; OutPre: Output of Pre-
pare function.
Output: A vector containing the estimated digits of the query data in QueryData.
Each estimated digit is one of the following digits (-1, 0, 1, 2, , 9) with the digit -1 meaning unknown and to be classified manually.
Restrictions on Methods:
1. Principal components analysis: (a) Free to perform any kind of transformation before principal components analysis; (b) Must use function prcomp for principal components analysis; (c) Free to determine the number of principal components chosen; (d) Can use prcomp a number of times.
2. Classifier: (a) You can only use the simplest form of k-nearest neighbor algorithm (kNN) to classify handwritten digits where k can be any positive integer (the classifier used in Section 3.8.2 is 1-nearest neighbor classifier; see Appendix A for a brief introduction to kNN); (b) The input of the kNN algorithm can be the principal components or any transformed form of the principal components; (c) You can use function knn in class package for k-nearest neighbor classification.
3. Cross-validation: (a) You are recommended to use cross-validation to assess performance of several candidate classifiers and choose the best one as your final method; (b) You can use the simplest form of cross-validation which is used in Section 3.8.2, or use k-fold cross-validation (see Appendix B for a brief introduction; see Appendix D for a sample program).
Appendix A: k-NN algorithm:
Step 1: Select a positive integer k.
Step 2: Find k nearest neighbors of a query data.
Step 3: Find the categories of the k neighbors. Assign the query data to the category of the majority of its k neighbors.
Appendix B: k-fold cross-validation:
Step 1: Divide the available dataset of size n randomly into k roughly equal groups, say A1Ak.
Step 2: For i = 1.. k, do {
Use Ai as test data and combine all remaining k -1 groups to form our training data. Use the training data to build a classifier. Apply the classifier to classify the test data. Compute ai, the number of correct classification. }
Step 3: The estimated correct classification rate is
Appendix C:Sample Prepare and Classify functions
Prepare <- function(trainData,DigitV) {# If needed, enter library command(s) here.d <- prcomp(trainData)list(mu=d$center,u=d$rotation[,1:30],y=d$x[,1:30],Digit=DigitV,epsilon=25e5) }Classify <- function(QueryData,OutPre) {# If needed, enter library command(s) here.m <- dim(QueryData)[1]r <- numeric(m)for (i in 1:m) {w <- t(OutPre$u)%*%(QueryData[i,]-OutPre$mu)minD <- Inffor (j in 1:(dim(OutPre$y)[1])) {dist <- sum((w-OutPre$y[j,])^2)if (dist < minD) { r[i] <- OutPre$Digit[j]; minD <- dist }}if (minD > OutPre$epsilon) r[i] <- -1 }Appendix D: Sample k-fold cross-validation programCValidate <- function(dataSet,TDigit,k) {# Perform k-fold cross-validation# for the provided “Prepare” and “Classify” functions.n <- dim(dataSet)[1]b <- sample(rep(1:k,length=n))TrueDigit <- EstDigit <- NULLfor (i in 1:k) {train <- dataSet[b!=i,] # training datatest <- dataSet[b==i,] # test datav <- Prepare(train,TDigit[b!=i])r <- Classify(test,v)TrueDigit <- c(TrueDigit,TDigit[b==i]); EstDigit <- c(EstDigit,r) }print(table(`True digit`=TrueDigit,`Estimated digit`=EstDigit)) }Assessment Scheme: The performance of the Prepare” and Classify” functions will be evaluated using 1000 test images. The grade is determined by the following four factors:(1) Correct classification rate for 1000 query data (40%):Rate = [(number of correctly classified digits) + 0.5(number of unknown digits)]/1000.Fraction of mark obtained is max([(r-0.9)/(MaxR-0.9)]40%,0), where r is the rate of the provided classifier and MaxR is the best rate in the whole class.(2) Economy in storage (30%):Storage used is the size of the output of Prepare”. Fraction of mark is max([(120000-s)/(120000-MinS)]30%,0) where s is the storage used by the provided classifier, and MinS is the minimum storage used in the whole class.(3) Elegance of method (20%)(4) report writing (10%) CS: assignmentchef QQ: 1823890830 Email: [email protected]
Reviews
There are no reviews yet.