In this homework, we will examine the task of scene recognition starting with very simple methods: tiny images and nearest neighbor classification, and then move on to more advanced methods: bags of quantized local features and linear classifiers learned by support vector machines.
Bag of words models are a popular technique for image classification inspired by models used in natural language processing. The model ignores or downplays word arrangement (spatial information in the image) and classifies based on a histogram of the frequency of visual words. The visual word vocabulary is established by clustering a large corpus of local features. See Szeliski chapter 14.4.1 for more details on category recognition with quantized features. In addition, 14.3.2 discusses vocabulary creation and 14.1 covers classification techniques.
For this homework you will be implementing a basic bag of words model. You will classify scenes into one of 16 categories by training and testing on the 16 scene database (introduced in Lazebnik et al. 2006
(http://www.di.ens.fr/willow/pdfs/cvpr06b.pdf), although built on top of previously published datasets). Lazebnik et al. 2006 (http://www.di.ens.fr/willow/pdfs/cvpr06b.pdf) is a great paper to read, although we will be implementing the baseline method the paper discusses (equivalent to the zero level pyramid) and not the more sophisticated spatial pyramid. For an excellent survey of pre-deep-learning feature encoding methods for bag of words models, see Chatfield et al, 2011 (http://www.robots.ox.ac.uk/~vgg/research/encoding_eval/).
You are required to implement 2 different image representations: tiny images and bags of SIFT features, and 2 different classification techniques: nearest neighbor and linear SVM. There are 3 problems plus a performance report in this homework with a total of 100 points. 1 bonus question with extra 10 points is provided under problem 3. The maximum points you may earn from this homework is 100 + 10 = 110 points. Be sure to read Submission Guidelines below. They are important.
Dataset
The starter code trains on 150 and tests on 50 images from each category (i.e. 2400 training examples total and 800 test cases total). In a real research paper, one would be expected to test performance on random splits of the data into training and test sets, but the starter code does not do this to ease debugging.
Save the dataset(click me) (https://drive.google.com/drive/folders/1NWC3TMsXSWN2TeoYMCjhf2N1b-WRDhM?usp=sharing) into your working folder in your Google Drive for this homework.
Under your root folder, there should be a folder named data (i.e. XXX/Surname_Givenname_SBUID/data) containing the images. Do not upload the data subfolder before submitting on blackboard due to size limit.
There should be only one .ipynb file under your root folder Surname_Givenname_SBUID.
Starter Code
To make your task a little easier, below we provide some starter code which randomly guesses the category of every test image and achieves about 6.25% accuracy (1 out of 16 guesses is correct).
Problem 1: Tiny Image Representation + Nearest Neighbor Classifier
{25 points} You will start by implementing the tiny image representation and the nearest neighbor classifier. They are easy to understand, easy to implement, and run very quickly for our experimental setup.
The tiny image feature is one of the simplest possible image representations. One simply resizes each image to a small, fixed resolution. You are required to resize the image to 1616. It works slightly better if the tiny image is made to have zero mean and unit length (normalization). This is not a particularly good representation, because it discards all of the high frequency image content and is not especially invariant to spatial or brightness shifts. We are using tiny images simply as a baseline.
The nearest neighbor classifier is equally simple to understand. When tasked with classifying a test feature into a particular category, one simply finds the nearest training example (L2 distance is a sufficient metric) and assigns the label of that nearest training example to the test example. The nearest neighbor classifier has many desirable features it requires no training, it can learn arbitrarily complex decision boundaries, and it trivially supports multiclass problems. It is quite vulnerable to training noise, though, which can be alleviated by voting based on the K nearest neighbors (but you are not required to do so). Nearest neighbor classifiers also suffer as the feature dimensionality increases, because the classifier has no mechanism to learn which dimensions are irrelevant for the decision.
Report your classification accuracy on the test sets and time consumption
Problem 2: Bag of SIFT Representation + Nearest Neighbor Classifer
{35 points} After you have implemented a baseline scene recognition pipeline it is time to move on to a more sophisticated image representation bags of quantized SIFT features. Before we can represent our training and testing images as bag of feature histograms, we first need to establish a vocabulary of visual words. We will form this vocabulary by sampling many local features from our training set (10s or 100s of thousands) and then cluster them with k-means. The number of k-means clusters is the size of our vocabulary and the size of our features. For example, you might start by clustering many SIFT descriptors into k=50 clusters. This partitions the continuous, 128 dimensional SIFT feature space into 50 regions. For any new SIFT feature we observe, we can figure out which region it belongs to as long as we save the centroids of our original clusters. Those centroids are our visual word vocabulary. Because it can be slow to sample and cluster many local features, the starter code saves the cluster centroids and avoids recomputing them on future runs.
Now we are ready to represent our training and testing images as histograms of visual words. For each image we will densely sample many SIFT descriptors. Instead of storing hundreds of SIFT descriptors, we simply count how many SIFT descriptors fall into each cluster in our visual word vocabulary. This is done by finding the nearest neighbor k-means centroid for every SIFT feature. Thus, if we have a vocabulary of 50 visual words, and we detect 220 distinct SIFT features in an image, our bag of SIFT representation will be a histogram of 50 dimensions where each bin counts how many times a SIFT descriptor was assigned to that cluster. The total of all the bin-counts is 220. The histogram should be normalized so that image size does not dramatically change the bag of features magnitude.
After you obtain the Bag of SIFT feature representation of the images, you have to train a KNN classifier in the Bag of SIFT feature space and report your test set accuracy and time consumption.
Note:
Instead of using SIFT to detect invariant keypoints which is time-consuming, you are recommended to densely sample keypoints in a grid with certain step size (sampling density) and scale.
There are many design decisions and free parameters for the bag of SIFT representation (number of clusters, sampling density, sampling scales, SIFT parameters, etc.) so accuracy might vary from 50% to 60%.
Indicate clearly the parameters you use along with the prediction accuracy on test set and time consumption.
Hints:
Use KMeans in Sklearn (http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) to do clustering and find the nearest cluster centroid for each SIFT feature;
Use cv2.xfeatures2d.SIFT_create() to create a SIFT object;
Use sift.compute() to compute SIFT descriptors given densely sampled keypoints (cv2.Keypoint
(https://docs.opencv.org/3.0-beta/modules/core/doc/basic_structures.html?highlight=keypoint#keypoint)).
Be mindful of RAM usage. Try to make the code more memory efficient, otherwise it could easily exceed RAM limits in Colab, at which point your session will crash.
If your RAM is going to run out of space, use gc.collect() (https://docs.python.org/3/library/gc.html) for the garbage collector to collect unused objects in memory to free some space.
Problem 3.a: Bag of SIFT Representation + one-vs-all SVMs
{15 points} The last task is to train one-vs-all linear SVMS to operate in the bag of SIFT feature space. Linear classifiers are one of the simplest possible learning models. The feature space is partitioned by a learned hyperplane and test cases are categorized based on which side of that hyperplane they fall on. Despite this model being far less expressive than the nearest neighbor classifier, it will often perform better.
You do not have to implement the support vector machine. However, linear classifiers are inherently binary and we have a 16-way classification problem (the library has handled it for you). To decide which of 16 categories a test case belongs to, you will train 16 binary, one-vs-all SVMs. One-vs-all means that each classifier will be trained to recognize forest vs non-forest, kitchen vs non-kitchen, etc. All 16 classifiers will be evaluated on each test case and the classifier which is most confidently positive wins. E.g. if the kitchen classifier returns a score of -0.2 (where 0 is on the decision boundary), and the forest classifier returns a score of -0.3, and all of the other classifiers are even more negative, the test case would be classified as a kitchen even though none of the classifiers put the test case on the positive side of the decision boundary. When learning an SVM, you have a free parameter (lambda) which controls how strongly regularized the model is. Your accuracy will be very sensitive to , so be sure to try many values.
Indicate clearly the parameters you use along with the prediction accuracy on test set and time consumption.
Bonus {10 points}: For this question, you need to generate class prediction for the images in test2 folder using your best model. The prediction file(Surname_Givenname_SBUID_Pred.txt) should follow the exact format as given in the sample.txt file.10 points will be given to students whose accuracy ranks top 3 in this homework.
Hints:
Use SVM in Sklearn (http://scikit-learn.org/stable/modules/classes.html#module-sklearn.svm)
(recommended) or OpenCV (https://docs.opencv.org/3.0alpha/modules/ml/doc/support_vector_machines.html) to do training and prediction.
Problem 3.b
{5 points} Repeat the evaluation above for different sizes of training sets and draw a plot to show how the size of the training set affects test performace. Do this for training set sizes of 800, 1200, 1600, 2000, 2200, and 2300 images. Randomly sample the images from the original training set and evaluate accuracy. Repeat this process 10 times for each training set size and report the average prediction accuracy. How does performance variability change with training set size? How does performance change? Give reason for your observations.
How does performance variability change with training set size? How does performance change? Give reason for your observations.
Here, I am shuffling training data and selecting a subset to train the classifier to evaluate on test data. Out of data size 2400, subset size starts from 800, which is 1/3 of data. This undoubtebly underperforms than the full data. Graph shows accuracy is almost linear correlated with training data size. That straight line may break at higher end sometimes. From looking at standard deviation at each size, decrease in standard deviation is observed while increased sizes. This is logically true as well because less but random sampled data will have more sensitive accuracy than large sized data while validated on unseen data.
Performance Report
{20 points} Please report the performance of the following combinations in the given order in terms of the time consumed and classification accuracy. Describe your algorithm, any decisions you made to write your algorithm in your particular way, and how different choices you made affect it. Compute and draw a (normalized) confusion matrix (https://en.wikipedia.org/wiki/Confusion_matrix), and discuss where the method performs best and worse for each of the combination. Here is an example (http://scikitlearn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html#sphx-glr-auto-examples-modelselection-plot-confusion-matrix-py) of how to compute confusion matrix.
1st: Tiny images representation and nearest neighbor classifier (accuracy of about 18-25%).
2nd: Bag of SIFT representation and nearest neighbor classifier (accuracy of about 40-50%).
3rd: Bag of SIFT representation and linear SVM classifier (accuracy of about 50-70%).
Algorithm Data Processing time TrainTestTime Max Accuracy
1st 0.1127 0.5204 0.2275
2nd 705 0.2811 0.48
3rd 705 1.063 0.5025
1st: Tiny images representation and nearest neighbor classifier To resize the images, I have used INTER_AREA interpolation method, which resamples using pixel area relation. In terms of (information loss / appearance), I think this interpolation will be better than other methods, but at the cost of processing time. I have resized each image to (16,16) size and then trained using KNeighborsClassifier (n neighbors =1. For the case of n_neighbors > 1, I used distance function as weight, where closer neighbors will have a greater influence than neighbors far away.This performed well according to expectation with 0.2275 accuracy.
2nd: Bag of SIFT representation and nearest neighbor classifier To computer descriptors for each image, I am providing densely sampled keypoints(step size = 25, scale = 10). The descripters are then used in KMeans(n clusters=64) to get a normalised histogram out of it.The histograms are given as training data along with true labels to KNeighboursClassifier(n_neighbors = 50, weight = distance) after shuffling. The model is able to gain
0.48 accuracy which is significant improvement (+0.26 improvement than in problem 1).
3rd: Bag of SIFT representation and linear SVM classifier SVMs! In image classification, it is usually assumed that SVMs perform better than NN classifiers in some kind of datasets. Therefore, I used the same data from problem 2 to validate this assumption. Using linear SVMs for multiclass classification was handled by sklearn.svm.SVC. The parameter I could tune was lambda value, else in worst case I could use data with different feature sizes more convenient with SVMs. Maximum accuracy with svm.SVC(lambda = 10, kernel = linear, decision_function_shape=ovr) was 0.50. I also experimented with rbf kernel with gamma as scale, which performs slightly better than the linearSVM.
Reviews
There are no reviews yet.