The goal of the assignment is to implement a system for image classification. In other words, this system should tell if there is an object of given class in an image. You will perform 5-class ( {1 : airplanes,2 : birds,3 : ships,4 : horses,5 : cars}) image classification based on bag-of-words approach1 using SIFT features, respectively. STL-10 dataset2 will be used for the task. For each class, test sub-directories contain 800 images, and training sub-directories contain 500 images. Images are represented as (RGB) 9696 pixels.
Hint
In a real scenario, the public data you use often deviates from your task. You need to figure it out and re-arrange the label as required using stl10 input.py as a reference.
Download the dataset from http://ai.stanford.edu/~acoates/stl10/stl10_binary.tar.gz. There are five files: test X.bin, test y.bin, train X.bin,train y.bin and unlabeled X.bin. For the project, you will just use the train and test partitions. Download the dataset and make yourself familiar with it by figuring out which images and labels you need for the aforementioned 5 classes. Note that you do not need fold indices variable.
1.1 Training Phase
Training must be conducted over the training set. Keep in mind that using more samples in training will likely result in better performance. However, if your computational resources are limited and/or your system is slow, its OK to use less number of training data to save time.
http://www.robots.ox.ac.uk/~az/icvss08_az_bow.pdf
2 https://cs.stanford.edu/~acoates/stl10/
1.2 Testing Phase
You have to test your system using the specified subset of test images. All 800 test images should be used at once for testing to observe the full performance. Again, exclude them from training for fair comparison.
2 Bag-of-Words based Image Classification
Bag-of-Words based Image Classification system contains the following steps:
- Feature extraction and description
- Building a visual vocabulary
- Quantify features using visual dictionary (encoding)
- Representing images by frequencies of visual words
- Train the classifier
We will consider each step in detail.
2.1 Feature Extraction and Description
SIFT descriptors can be extracted from either (1) densely sampled regions or (2) key points. You can use SIFT related functions in OpenCV for feature extraction.
2.2 Building Visual Vocabulary
Here, we will obtain visual words by clustering feature descriptors, so each cluster center is a visual word, as shown in Figure 1. Take a subset (maximum half) of all training images (this subset should contain images from ALL categories), extract SIFT descriptors from all of these images, and run k-means clustering (you can use your favourite k-means implementation) on these SIFT descriptors to build visual vocabulary. Then, take the rest of the training images to calculate visual dictionary. Nonetheless, you can also use less images, say 100 from each class (exclusive from the previous subset) if your computational resources are limited. Pre-defined cluster numbers will be the size of your vocabulary. Set it to different sizes (500, 1000 and 2000).
Figure 1: An illustration on learning visual dictionary. Note: (1) Code-words is another term for visual words. (2) The figure is from Hosef Slvic, with SIFT space used.
2.3 Encoding Features Using Visual Vocabulary
Once we have a visual vocabulary, we can represent each image as a collection of visual words. For this purpose, we need to extract feature descriptors (SIFT) and then assign each descriptor to the closest visual word from the vocabulary.
2.4 Representing images by frequencies of visual words
The next step is the quantization. The idea is to represent each image by a histogram of its visual words, see Figure 2 for overview. Check out matplotlibs hist function. Since different images can have different numbers of features, histograms should be normalized.
Figure 2: Schematic representation of Bag-Of-Words system.
2.5 Classification
We will train a classifier per each object class. Now, we take the Support Vector Machine (SVM) as an example. As a result, we will have 5 binary classifiers. Take images from the training set of the related class (should be the ones which you did not use for dictionary calculation). Represent them with histograms of visual words as discussed in the previous section. Use at least 50 training images per class or more, but remember to debug your code first! If you use the default setting, you should have 50 histograms of size 500. These will be your positive examples. Then, you will obtain histograms of visual words for images from other classes, again about 50 images per class, as negative examples. Therefore, you will have 200 negative examples. Now, you are ready to train a classifier. You should repeat it for each class. To classify a new image, you should calculate its visual words histogram as described in Section 2.4 and use the trained SVM classifier to assign it to the most probable object class. (Note that for proper SVM scores you need to use cross-validation to get a proper estimate of the SVM parameters. In this assignment, you do not have to experiment with this cross-validation step).
2.6 Evaluation
To evaluate your system, you should take all the test images from all classes and rank them based on each binary classifier. In other words, you should classify each test image with each classifier and then sort them based on the classification score. As a result, you will have five lists of test images. Ideally, you would have images with airplanes on the top of your list which is created based on your airplane classifier, and images with cars on the top of your list which is created based on your car classifier, and so on.
In addition to the qualitative analysis, you should measure the performance of the system quantitatively with the Mean Average Precision over all classes. The Average Precision for a single class c is defines as
, (1)
where n is the number of images (n = 50 5 = 250), m is the number of images of class c (mc = 50), xi is the ith image in the ranked list X = {x1,x2,,xn}, and finally, fc is a function which returns the number of images of class c in the first i images if xi is of class c, and 0 otherwise. To illustrate, if we want to retrieve R and we get the following sequence: [R,R,T,R,T,T,R,T], then n = 8, m = 4, and.
Reviews
There are no reviews yet.