Figure 1: Histogram of oriented gradients. HOG feature is extracted and visualized for
(a) the entire image and (b) zoom-in image. The orientation and magnitude of the red
lines represents the gradient components in a local cell.
In this assignment, you will implement a variant of HOG (Histogram of Oriented Gradients) in MATLAB proposed by Dalal and Trigg [1] (2015 Longuet-Higgins Prize Winner). It had been long standing top representation (until deep learning) for the object
detection task with a deformable part model by combining with a SVM classifier [2].
Given an input image, your algorithm will compute the HOG feature and visualize as
shown in Figure 1 (the line directions are perpendicular to the gradient to show edge
alignment). The orientation and magnitude of the red lines represents the gradient
components in a local cell.
function [hog] = HOG(im)
Input: input gray-scale image with uint8 format.
Output: HOG descriptor.
You will compute the HOG descriptor of input image im. The pseudocode can be found below:
Algorithm 1 HOG
1: Convert the gray-scale image to double format.
2: Get differential images using GetDifferentialFilter and FilterImage
3: Compute the gradients using GetGradient
4: Build the histogram of oriented gradients for all cells using BuildHistogram
5: Build the descriptor of all blocks with normalization using GetBlockDescriptor
6: Return a long vector (hog) by concatenating all block descriptors.
2.1 Image filtering
m
n
(a) Input image (b) Differential along x
direction
(c) Differential along y
direction
Figure 2: (a) Input image dimension. (b-c) Differential image along x and y directions.
function [filter_x, filter_y] = GetDifferentialFilter()
Input: none.
Output: filter_x and filter_y are 3×3 filters that differentiate along x and y directions, respectively.
You will compute the gradient by differentiating the image along x and
y directions. This code will output the differential filters.
function [im_filtered] = FilterImage(im, filter)
Input: im is the gray scale m × n image (Figure 2(a)) converted to double (refer to
im2double built-in function); filter is a filter (k × k matrix)
Output: im_filtered is m × n filtered image. You may need to pad zeros on the
boundary on the input image to get the same size filtered image.
Description: Given an image and filter, you will compute the filtered image. Given the
two functions above, you can generate differential images by visualizing the magnitude
of the filter response as shown in Figure 2(b) and 2(c).
(a) Magnitude
0
20
40
60
80
100
120
140
160
180
(b) Angle (c) Gradient (d) Zoomed eye (e) Zoomed neck
Figure 3: Visualization of (a) magnitude and (b) orientation of image gradients. (c-e)
Visualization of gradients at every 3rd pixel (the magnitudes are re-scaled for illustrative
purpose.).
function [grad_mag, grad_angle] = GetGradient(im_dx, im_dy)
Input: im_dx and im_dy are the x and y differential images (size: m × n).
Output: grad_mag and grad_angle are the magnitude and orientation of the gradient
images (size: m × n). Note that the range of the angle should be [0, π), i.e., unsigned
angle (θ == θ + π).
Description: Given the differential images, you will compute the magnitude and angle
of the gradient. Using the gradients, you can visualize and have some sense with the
image, i.e., the magnitude of the gradient is proportional to the contrast (edge) of the local patch and the orientation is perpendicular to the edge direction as shown in Figure 3.
4
2.3 Orientation Binning
Ignore this shaded area
c
ell_
siz
e
Store gradient mag
4,3
M
N
(u,v)
(a) ori histo
165 15 45 75 105 135 165
S
u
m
o
f
m
a
g
nit
u
d
e
s
(b) Histogram per cell
Figure 4: (a) Histogram of oriented gradients can be built by (b) binning the gradients
to corresponding bin.
function ori_histo = BuildHistogram(grad_mag, grad_angle, cell_size)
Input: grad_mag and grad_angle are the magnitude and orientation of the gradient
images (size: m × n); cell_size is the size of each cell, which is a positive integer.
Output: ori_histo is a 3D tensor with size M × N × 6 where M and N are the
number of cells along y and x axes, respectively, i.e., M = bm/cell_sizec and N =
bn/cell_sizec where b·c is the round-off operation as shown in Figure 4(a).
Description: Given the magnitude and orientation of the gradients per pixel, you can
build the histogram of oriented gradients for each cell.
ori_histo(i, j, k) = X
(u,v)∈Ci,j
grad_mag(u, v) if grad_angle(u, v) ∈ θk (1)
where Ci,j is a set of x and y coordinates within the (i, j) cell, and θk is the angle
range of each bin, e.g., θ1 = [165◦
, 180◦
) ∪ [0◦
, 15◦
), θ2 = [15◦
, 45◦
), θ3 = [45◦
, 75◦
),
θ4 = [75◦
, 105◦
), θ5 = [105◦
, 135◦
), and θ6 = [135◦
, 165◦
). Therefore, ori_histo(i,j,:)
returns the histogram of the oriented gradients at (i, j) cell as shown in Figure 4(b).
Using the ori_histo, you can visualize HOG per cell where the magnitude of the line
proportional to the histogram as shown in Figure 1. Typical cell_size is 8.
5
Blo2cxk2 block
Concatenation of HOG
and normalization
Block
M
N
(a) Block descriptor
Block
M-1
N-1
(b) Block overlap with stride 1
Figure 5: HOG is normalized to account illumination and contrast to form a descriptor
for a block. (a) HOG within (1,1) block is concatenated and normalized to form a long
vector of size 24. (b) This applies to the rest block with overlap and stride 1 to form
the normalized HOG.
function ori_histo_normalized = GetBlockDescriptor(ori_histo, block_size)
Input: ori_histo is the histogram of oriented gradients without normalization. block_size
is the size of each block (e.g., the number of cells in each row/column), which is a positive integer.
Output: ori_histo_normalized is the normalized histogram (size: (M−(block_size−
1)) × (N − (block_size − 1)) × (6 × block_size2
).
To account for changes in illumination and contrast, the gradient strengths
must be locally normalized, which requires grouping the cells together into larger, spatially connected blocks (adjacent cells). Given the histogram of oriented gradients, you
apply L2 normalization as follow:
1. Build a descriptor of the first block by concatenating the HOG within the block.
You can use block_size=2, i.e., 2 × 2 block will contain 2 × 2 × 6 entries that
will be concatenated to form one long vector as shown in Figure 5(a).
2. Normalize the descriptor as follow:
hˆ
i = p
hi
P
i
h
2
i + e
2
(2)
where hi
is the i
th element of the histogram and hˆ
i
is the normalized histogram.
e is the normalization constant to prevent division by zero (e.g., e = 0.001).
3. Assign the normalized histogram to ori_histo_normalized(1,1) (white dot location in Figure 5(a)).
4. Move to the next block ori_histo_normalized(1,2) with the stride 1 and iterate
1-3 steps above.
The resulting ori_histo_normalized will have the size of (M − 1) × (N − 1) × 24.
6
References
[1] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in
CVPR, 2005.
[2] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part based models,” TPAMI, 2010.
(a) Image (b) SIFT
Figure 1: Given an image (a), you will extract SIFT features using VLFeat.
One of key skills to learn in computer vision is the ability to use other open source
code, which allow you not to re-invent the wheel. We will use VLFeat by A. Vedaldi
and B. Fulkerson (2008) for SIFT extraction given your images. Install VLFeat from
following:
https://www.vlfeat.org/install-matlab.html
Run vl demo sift basic to double check the installation is completed.
(Note) You will use this library only for SIFT feature extraction and its visualization.
All following visualizations and algorithms must be done by your code. Using VLFeat,
you can extract keypoints and associated descriptors as shown in Figure 1.
(SIFT visualization) Use VLFeat to visualize SIFT features with scale and orientation
as shown in Figure 1. You may want to follow the following tutorial:
https://www.vlfeat.org/overview/sift.html
(a) Template (b) Target (c) SIFT matches with ratio test
Figure 2: You will match points between the template and target image using SIFT
features.
The SIFT is composed of scale, orientation, and 128 dimensional local feature descriptor
(integer), f ∈ Z128. You will use the SIFT features to match between two images, I1
and I2. Use two sets of descriptors from the template and target, find the matches
using nearest neighbor with the ratio test. You may use knnsearch built-in function
in MATLAB.
function [x1, x2] = FindMatch(I1, I2)
Input: two input gray-scale images with uint8 format.
Output: x1 and x2 are n × 2 matrices that specify the correspondence.
Description: Each row of x1 and x2 contains the (x, y) coordinate of the point correspondence in I1 ad I2, respectively, i.e., x1(i,:) ↔ x2(i,:).
(Note) You can only use VLFeat for the SIFT descriptor extraction. Matching with the
ratio test needs to be implemented by yourself.
4 Feature-based Image Alignment
Figure 3: You will compute an affine transform using SIFT matches filtered by
RANSAC. Blue: outliers; Orange: inliers; Red: the boundary of the transformed template.
(Note) From this point, you cannot use any function provided by VLFeat.
The noisy SIFT matches can be filtered by RANSAC with an affine transformation as
shown in Figure 3.
function [A] = AlignImageUsingFeature(x1, x2, ransac_thr, ransac_iter)
Input: x1 and x2 are the correspondence sets (n × 2 matrices). ransac_thr and
ransac_iter are the error threshold and the number of iterations for RANSAC.
Output: 3 × 3 affine transformation.
Description: The affine transform will transform x1 to x2, i.e., x2 = Ax1. You may
visualize the inliers and the boundary of the transformed template to validate your
implementation.
(a) Image (b) Warped image
(c) Template (d) Error map
Figure 4: You will use the affine transform to warp the target image to the template
using the inverse mapping. Using the warped image, the error map |Itpl − Iwrp| can be
computed to validate the correctness of the transformation where Itpl and Iwrp are the
template and warped images.
Given an affine transform A, you will write a code to warp an image I(x) → I(Ax).
function [I_warped] = WarpImage(I, A, output_size)
Input: I is an image to warp, A is the affine transformation from the original coordinate
to the warped coordinate, output_size=[h,w] is the size of the warped image where
w and h are the width and height of the warped image.
Output: I_warped is the warped image with the size of output_size.
The inverse mapping method needs to be applied to make sure the
warped image does not produce empty pixel. You are allowed to use interp2 build-in
function in MATLAB for bilinear interpolation.
(Validation) Using the warped image, the error map |Itpl − Iwrp| can be computed to
validate the correctness of the transformation where Itpl and Iwrp are the template and
warped images.
(a) Template (b) Initialization (c) Aligned image
Figure 5: You will use the initial estimate of the affine transform to align (i.e., track)
next image. (a) Template image from the first frame image. (b) The second frame
image with the initialization of the affine transform. (c) The second frame image with
the optimized affine transform using the inverse compositional image alignment.
Given the initial estimate of the affine transform A from the feature based image alignment (Section 4) as shown in Figure 5(b), you will track the next frame image using the
inverse compositional method (Figure 5(c)). You will parametrize the affine transform
with 6 parameters p = (p1, p2, p3, p4, p5, p6), i.e.,
W(x; p) =
p1 + 1 p2 p3
p4 p5 + 1 p6
0 0 1
u
v
1
= A(p)x (1)
where W(x; p) is the warping function from the template patch to the target image.
x =
u
v
1
is the coordinate of the point before warping, and A(p) is the affine transform
parametrized by p.
function [A_refined] = AlignImage(template, target, A)
Input: gray-scale template template and target image target; the initialization of
3×3 affine transform A, i.e., xtgt =Axtpl where xtgt and xtpl are points in the target and
template images, respectively.
Output: A_refined is the refined affine transform based on inverse compositional image alignment
Description: You will refine the affine transform using inverse compositional image
alignment, i.e., A→A_refined. The pseudo-code can be found in Algorithm 1.
Tip: You can validate your algorithm by visualizing their error map as shown in Figure 6(d) and 6(h). Also you can visualize the error plot over iterations, i.e., the error
must decrease as shown in Figure 6(i).
(a) Template (b) Initial warp (c) Overlay (d) Error map
(e) Template (f) Opt. warp (g) Overlay (h) Error map
0 50 100 150 200 250 300 350 400 450
Iteration
20
22
24
26
28
30
32
Error ||Itpl – Itgt||
(i) Error map
Figure 6: (a,e) Template images of the first frame. (b) Warped image based on the initialization of the affine parameters. (c) Template image is overlaid by the initialization.
(d) Error map of the initialization. (f) Optimized warped image using the inverse compositional image alignment. (g) Template image is overlaid by the optimized warped
image. (h) Error map of the optimization. (i) An error plot over iterations.
Algorithm 1 Inverse Compositional Image Alignment
1: Initialize p = p0 from input A.
2: Compute the gradient of template image, ∇Itpl
3: Compute the Jacobian ∂W
∂p at (x; 0).
4: Compute the steepest decent images ∇Itpl
∂W
∂p
5: Compute the 6 × 6 Hessian H =
P
x
h
∇Itpl
∂W
∂p iT h
∇Itpl
∂W
∂p i
6: while kpk > do
7: Warp the target to the template domain Itgt(W(x; p)).
8: Compute the error image Ierr = Itgt(W(x; p)) − Itpl.
9: Compute F =
P
x
h
∇Itpl
∂W
∂p iT
Ierr.
10: Compute ∆p = H−1F.
11: Update W(x; p) ← W(x; p) ◦ W(x; ∆p) = W(W(x; ∆p); p).
12: end while
13: Return A_refined made of p.
7
7 Putting Things Together: Multiframe Tracking
(a) Frame 1 (b) Frame 2
(c) Frame 3 (d) Frame 4
Figure 7: You will use the inverse compositional image alignment to track 4 frames of
images.
Given a template and a set of consecutive images, you will (1) initialize the affine
transform using the feature based alignment and then (2) track over frames using the
inverse compositional image alignment.
function [A_cell] = TrackMultiFrames(template, image_cell)
Input: template is gray-scale template. image_cell is a cell structure that contains
a set of consecutive image frames, i.e., image_cell{i} is the i
th frame.
Output: A_cell is the set of affine transforms from the template to each frame of
image, i.e., A_cell{i} is the affine transform from the template to the i
th image.
Description: You will apply the inverse compositional image alignment sequentially
to track over frames as shown in Figure 7. Note that the template image needs to be
updated at every frame, i.e., template←WarpImage(I, inv(A), size(template)).
• Individual assignment
• 2 page summary write-up with resulting visualization (more than 1 page assignment will be automatically returned.).
• Submission through Canvas.
• List of submission codes:
– GetTinyImage.m
– PredictKNN.m
– ClassifyKNN_Tiny.m
– BuildVisualDictionary.m
– ComputeBoW.m
– ClassifyKNN_BoW.m
– PredictSVM.m
– ClassifySVM_BoW.m
• DO NOT SUBMIT THE PROVIDED IMAGE DATA
• The function that does not comply with its specification will not be graded.
• You are allowed to use MATLAB built-in functions except for the ones in the
Computer Vision Toolbox. Please consult with TA if you are not sure about the
list of allowed functions.
Figure 1: You will design a visual recognition system to classify the scene categories.
The goal of this assignment is to build a set of visual recognition systems that classify
the scene categories. The scene classification dataset consists of 15 scene categories
including office, kitchen, and forest as shown in Figure 1 [1]. The system will compute
a set of image representations (tiny image and bag-of-word visual vocabulary) and
predict the category of each testing image using the classifiers (k-nearest neighbor and
SVM) built on the training data. A simple pseudo-code of the recognition system can
found below:
Algorithm 1 Scene Recognition
1: Load training and testing images
2: Build image representation
3: Train a classifier using the representations of the training images
4: Classify the testing data.
5: Compute accuracy of testing data classification.
For the knn classifier, step 3 and 4 can be combined.
2
You can download the training and testing data from here:
https://www.cs.umn.edu/~hspark/csci5561/scene_classification_data.zip
The data folder includes two text files (train.txt and test.txt) and two folders
(train and test). Each row in the text file specifies the image and its label, i.e.,
(label) (image path). The text files can be used to load images. In each folder, it
includes 15 classes (Kitchen, Store, Bedroom, LivingRoom, Office, Industrial, Suburb,
InsideCity, TallBuilding, Street, Highway, OpenCountry, Coast, Mountain, Forest) of
scene images.
Similar to HW #2, you will use VLFeat (https://www.vlfeat.org/install-matlab.
html). You are allowed to use the following two functions: vl_dsift and vl_svmtrain.
5 Tiny Image KNN Classification
(a) Image (b) Tiny Image
Figure 2: You will use tiny image representation to get an image feature.
function [feature] = GetTinyImage(I, output_size)
Input: I is an gray scale image, output_size=[w,h] is the size of the tiny image.
Output: feature is the tiny image representation by vectorizing the pixel intensity in
a column major order. The resulting size will be w×h.
You will simply resize each image to a small, fixed resolution (e.g.,
16×16). You need to normalize the image by having zero mean and unit length. This
is not a particularly good representation, because it discards all of the high frequency
image content and is not especially invariant to spatial or brightness shifts.
function [label_test_pred] = PredictKNN(feature_train, label_train, feature_test, k)
Input: feature_train is a ntr × d matrix where ntr is the number of training data
samples and d is the dimension of image feature, e.g., 265 for 16×16 tiny image representation. Each row is the image feature. label_train∈ [1, 15] is a ntr vector that
specifies the label of the training data. feature_test is a nte × d matrix that contains
the testing features where nte is the number of testing data samples. k is the number
of neighbors for label prediction.
Output: label_test_pred is a nte vector that specifies the predicted label for the
testing data.
Description: You will use a k-nearest neighbor classifier to predict the label of the
testing data.
4
Kit Sto Bed Liv Off Ind Sub Cty Bld St HW OC Cst Mnt For
Accuracy: 0.205333
Kitchen
Store
Bedroom
LivingRoom
Office
Industrial
Suburb
InsideCity
TallBuilding
Street
Highway
OpenCountry
Coast
Mountain
Forest
Figure 3: Confusion matrix for Tiny+KNN.
function [confusion, accuracy] = ClassifyKNN_Tiny
Output: confusion is a 15 × 15 confusion matrix and accuracy is the accuracy of the
testing data prediction.
You will combine GetTinyImage and PredictKNN for scene classification. Your goal is to achieve the accuracy >18%.
5
Figure 4: Each row represents a distinctive cluster from bag-of-word representation.
function [vocab] = BuildVisualDictionary(training_image_cell, dic_size)
Input: training_image_cell is a set of training images and dic_size is the size of
the dictionary (the number of visual words).
Output: vocab lists the quantized visual words whose size is dic_size×128.
Given a set of training images, you will build a visual dictionary made
of quantized SIFT features. You may start dic_size=50. You can use the following
built-in functions:
• vl_dsift from VLFeat.
• kmeans from MATLAB toolbox.
You may visualize the image patches to make sense the clustering as shown in Figure 4.
Algorithm 2 Visual Dictionary Building
1: For each image, compute dense SIFT over regular grid
2: Build a pool of SIFT features from all training images
3: Find cluster centers from the SIFT pool using kmeans algorithms.
4: Return the cluster centers.
Kit Sto Bed Liv Off Ind Sub Cty Bld St HW OC Cst Mnt For
Accuracy: 0.512667
Kitchen
Store
Bedroom
LivingRoom
Office
Industrial
Suburb
InsideCity
TallBuilding
Street
Highway
OpenCountry
Coast
Mountain
Forest
Figure 5: Confusion matrix for BoW+KNN.
function [bow_feature] = ComputeBoW(feature, vocab)
Input: feature is a set of SIFT features for one image, and vocab is visual dictionary.
Output: bow_feature is the bag-of-words feature vector whose size is dic_size.
Description: Give a set of SIFT features from an image, you will compute the bag-ofwords feature. The BoW feature is constructed by counting SIFT features that fall into
each cluster of the vocabulary. Nearest neighbor can be used to find the closest cluster
center. The histogram needs to be normalized such that BoW feature has a unit length.
function [confusion, accuracy] = ClassifyKNN_BoW
Output: confusion is a 15 × 15 confusion matrix and accuracy is the accuracy of the
testing data prediction.
Description: Given BoW features, you will combine BuildVisualDictionary, ComputeBoW,
and PredictKNN for scene classification. Your goal is to achieve the accuracy >50%.
7
function [label_test_pred] = PredictSVM(feature_train, label_train, feature_test)
Input: feature_train is a ntr × d matrix where ntr is the number of training data
samples and d is the dimension of image feature. Each row is the image feature.
label_train∈ [1, 15] is a ntr vector that specifies the label of the training data.
feature_test is a nte × d matrix that contains the testing features where nte is the
number of testing data samples.
Output: label_test_pred is a nte vector that specifies the predicted label for the
testing data.
Description: You will use a SVM classifier to predict the label of the testing data. You
don’t have to implement the SVM classifier. Instead, you can use VLFeat vl_svmtrain.
Linear classifiers are inherently binary and we have a 15-way classification problem. To
decide which of 15 categories a test case belongs to, you will train 15 binary, 1-vs-all
SVMs. 1-vs-all means that each classifier will be trained to recognize ‘forest’ vs ‘nonforest’, ‘kitchen’ vs ‘non-kitchen’, etc. All 15 classifiers will be evaluated on each test
case and the classifier which is most confidently positive “wins”. For instance, if the
‘kitchen’ classifier returns a score of -0.2 (where 0 is on the decision boundary), and the
‘forest’ classifier returns a score of -0.3, and all of the other classifiers are even more
negative, the test case would be classified as a kitchen even though none of the classifiers put the test case on the positive side of the decision boundary. When learning an
SVM, you have a free parameter ’lambda’ which controls how strongly regularized the
model is. Your accuracy will be very sensitive to lambda, so be sure to test many values.
8
Kit Sto Bed Liv Off Ind Sub Cty Bld St HW OC Cst Mnt For
Accuracy: 0.629333
Kitchen
Store
Bedroom
LivingRoom
Office
Industrial
Suburb
InsideCity
TallBuilding
Street
Highway
OpenCountry
Coast
Mountain
Forest
Figure 6: Confusion matrix for BoW+SVM.
function [confusion, accuracy] = ClassifySVM_BoW
Output: confusion is a 15 × 15 confusion matrix and accuracy is the accuracy of the
testing data prediction.
Description: Given BoW features, you will combine BuildVisualDictionary, ComputeBoW,
PredictSVM for scene classification. Your goal is to achieve the accuracy >60%.
References
[1] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid
matching for recognizing natural scene categories,” CVPR, 2006.
• Assignment due: Apr 19 (11:55pm)
• Individual assignment
• Up to 2 page summary write-up with resulting visualization (more than 2 page
assignment will be automatically returned.).
• Submission through Canvas.
• Skeletal codes can be downloaded from:
https://www-users.cs.umn.edu/~hspark/csci5561/HW4_code.zip. It contains
the following four codes:
– main_slp_linear.m
– main_slp.m
– main_mlp.m
– main_cnn.m
• List of submission codes:
– GetMiniBatch.m
– FC.m
– FC_backward.m
– Loss_euclidean.m
– TrainSLP_linear.m
– Loss_cross_entropy_softmax.m
– TrainSLP
– ReLu.m
– ReLu_backward.m
– TrainMLP.m
– Conv.m
– Conv_backward.m
– Pool2x2.m
– Pool2x2_backward.m
– Flattening.m
– Flattening_backward.m
– TrainCNN.m
• A list of MAT files that contain the following trained weights:
– slp_linear.mat: w, b
– slp.mat: w, b
– mlp.mat: w1, b1, w2, b2
– cnn.mat: w_conv, b_conv, w_fc, b_fc
• DO NOT SUBMIT THE PROVIDED IMAGE DATA
• The function that does not comply with its specification will not be graded.
• You are allowed to use MATLAB built-in functions except for the ones in the
Computer Vision Toolbox and Deep Learning Toolbox. Please consult with TA
if you are not sure about the list of allowed functions.
Figure 1: You will implement (1) a multi-layer perceptron (neural network) and (2)
convolutiona neural network to recognize hand-written digit using the MNIST dataset.
The goal of this assignment is to implement neural network to recognize hand-written
digits in the MNIST data.
MNIST Data You will use the MNIST hand written digit dataset to perform the first
task (neural network). We reduce the image size (28 × 28 → 14 × 14) and subsample
the data. You can download the training and testing data from here:
https://www.cs.umn.edu/~hspark/csci5561/ReducedMNIST.zip
The zip file includes two MAT files (mnist_train.mat and mnist_test.mat).
Each file includes im_* and label_* variables:
• im_* is a matrix (196 × n) storing vectorized image data (196 = 14 × 14)
• label_* is n × 1 vector storing the label for each image data.
n is the number of images. You can visualize the i
th image, e.g.,
imshow(uint8(reshape(im_train(:,i), [14,14]))).
3
x1 w
1 y 1 a
196 x
1
10 a 10 y
(a) Single linear perceptron
0 2000 4000 6000 8000
Iterations
6
7
8
9
10
11
12
13
Loss
Training loss
Testing loss
(b) Loss
0 1 2 3 4 5 6 7 8 9
Accuracy: 0.297905
0
1
2
3
4
5
6
7
8
9
(c) Confusion
Figure 2: You will implement a single linear perceptron that produces accuracy near
30%. Random chance is 10% on testing data.
You will implement a single-layer linear perceptron (Figure 2(a)) with stochastic gradient descent method. We provide main_slp_linear where you will implement GetMiniBatch
and TrainSLP_linear.
function [mini_batch_x, mini_batch_y] = GetMiniBatch(im_train,
label_train, batch_size)
Input: im_train and label_train are a set of images and labels, and batch_size is
the size of the mini-batch for stochastic gradient descent.
Output: mini_batch_x and mini_batch_y are cells that contain a set of batches (images and labels, respectively). Each batch of images is a matrix with size 194×batch_size,
and each batch of labels is a matrix with size 10×batch_size (one-hot encoding). Note
that the number of images in the last batch may be smaller than batch_size.
You may randomly permute the the order of images when building the
batch, and whole sets of mini_batch_* must span all training data.
function y = FC(x, w, b)
Input: x∈ Rm is the input to the fully connected layer, and w∈ Rn×m and b∈ Rn are
the weights and bias.
Output: y∈ Rn
is the output of the linear transform (fully connected layer).
Description: FC is a linear transform of x, i.e., y = wx + b.
function [dLdx dLdw dLdb] = FC_backward(dLdy, x, w, b, y)
Input: dLdy ∈ R1×n
is the loss derivative with respect to the output y.
4
Output: dLdx ∈ R1×m is the loss derivative with respect the input x, dLdw ∈ R1×(n×m)
is the loss derivative with respect to the weights, and dLdb ∈ R1×n
is the loss derivative
with respec to the bias.
The partial derivatives w.r.t. input, weights, and bias will be computed.
dLdx will be back-propagated, and dLdw and dLdb will be used to update the weights
and bias.
function [L, dLdy] = Loss_euclidean(y_tilde, y)
Input: y_tilde ∈ Rm is the prediction, and y∈ 0, 1
m is the ground truth label.
Output: L∈ R is the loss, and dLdy is the loss derivative with respect to the prediction.
Description: Loss_euclidean measure Euclidean distance L = ky − yek
2
.
function [w, b] = TrainSLP_linear(mini_batch_x, mini_batch_y)
Input: mini_batch_x and mini_batch_y are cells where each cell is a batch of images
and labels.
Output: w ∈ R10×196 and b ∈ R10×1 are the trained weights and bias of a single-layer
perceptron.
You will use FC, FC_backward, and Loss_euclidean to train a singlelayer perceptron using a stochastic gradient descent method where a pseudo-code can
be found below. Through training, you are expected to see reduction of loss as shown
in Figure 2(b). As a result of training, the network should produce more than 25% of
accuracy on the testing data (Figure 2(c)).
Algorithm 1 Stochastic Gradient Descent based Training
1: Set the learning rate γ
2: Set the decay rate λ ∈ (0, 1]
3: Initialize the weights with a Gaussian noise w ∈ N (0, 1)
4: k = 1
5: for iIter = 1 : nIters do
6: At every 1000th iteration, γ ← λγ
7: ∂L
∂w ← 0 and ∂L
∂b ← 0
8: for Each image xi
in k
th mini-batch do
9: Label prediction of xi
10: Loss computation l
11: Gradient back-propagation of xi
,
∂l
∂w
using back-propagation.
12: ∂L
∂w =
∂L
∂w +
∂l
∂w
and ∂L
∂b =
∂L
∂b +
∂l
∂b
13: end for
14: k++ (Set k = 1 if k is greater than the number of mini-batches.)
15: Update the weights, w ← w −
γ
R
∂L
∂w
, and bias b ← b −
γ
R
∂L
∂b
16: end for
5
x1 w
1 y
196 x
1
10 y
1 a
10 a
1
f
S
o
ft
–
m
a
x
10 f
(a) Single-layer perceptron
0 1000 2000 3000 4000 5000
Iterations
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Loss
Training loss
Testing loss
(b) Loss
0 1 2 3 4 5 6 7 8 9
Accuracy: 0.898720
0
1
2
3
4
5
6
7
8
9
(c) Confusion
Figure 3: You will implement a single perceptron that produces accuracy near 90% on
testing data.
You will implement a single-layer perceptron with soft-max cross-entropy using stochastic gradient descent method. We provide main_slp where you will implement TrainSLP.
Unlike the single-layer linear perceptron, it has a soft-max layer that approximates a
max function by clamping the output to [0, 1] range as shown in Figure 3(a).
function [L, dLdy] = Loss_cross_entropy_softmax(x, y)
Input: x ∈ Rm is the input to the soft-max, and y∈ 0, 1
m is the ground truth label.
Output: L∈ R is the loss, and dLdy is the loss derivative with respect to x.
Description: Loss_cross_entropy_softmax measure cross-entropy between two distributions L =
Pm
i yi
log yei where yei
is the soft-max output that approximates the max
operation by clamping x to [0, 1] range:
yei =
e
xi
P
i
e
xi
,
where xi
is the i
th element of x.
function [w, b] = TrainSLP(mini_batch_x, mini_batch_y)
Output: w ∈ R10×196 and b ∈ R10×1 are the trained weights and bias of a single-layer
perceptron.
Description: You will use the following functions to train a single-layer perceptron using a stochastic gradient descent method: FC, FC_backward, Loss_cross_entropy_softmax
Through training, you are expected to see reduction of loss as shown in Figure 3(b).
As a result of training, the network should produce more than 85% of accuracy on the
testing data (Figure 3(c)).
5 Multi-layer Perceptron
w1 1 x
196 x
1
1 y 1 a
10 y 10 a
1 a 1
f
m
a
mf
w2
1
f
S
o
ft
–
m
a
x
10 f
(a) Multi-layer perceptron
0 1 2 3 4 5 6 7 8 9
Accuracy: 0.914553
0
1
2
3
4
5
6
7
8
9
(b) Confusion
Figure 4: You will implement a multi-layer perceptron that produces accuracy more
than 90% on testing data.
You will implement a multi-layer perceptron with a single hidden layer using a stochastic
gradient descent method. We provide main_mlp. The hidden layer is composed of 30
units as shown in Figure 4(a).
function [y] = ReLu(x)
Input: x is a general tensor, matrix, and vector.
Output: y is the output of the Rectified Linear Unit (ReLu) with the same input size.
Description: ReLu is an activation unit (yi = max(0, xi)). In some case, it is possible
to use a Leaky ReLu (yi = max(xi
, xi) where = 0.01).
function [dLdx] = ReLu_backward(dLdy, x, y)
Input: dLdy ∈ R1×z
is the loss derivative with respect to the output y ∈ Rz where z
is the size of input (it can be tensor, matrix, and vector).
Output: dLdx ∈ R1×z
is the loss derivative with respect to the input x.
function [w1, b1, w2, b2] = TrainMLP(mini_batch_x, mini_batch_y)
Output: w1 ∈ R30×196
, b1 ∈ R30×1
, w2 ∈ R10×30
, b2 ∈ R10×1 are the trained weights
and biases of a multi-layer perceptron.
Description: You will use the following functions to train a multi-layer perceptron
using a stochastic gradient descent method: FC, FC_backward, ReLu, ReLu_backward,
Loss_cross_entropy_softmax. As a result of training, the network should produce
more than 90% of accuracy on the testing data (Figure 4(b)).
7
Input Conv (3) ReLu Pool (2×2) Flatten FC Soft-max
(a) CNN
0 1 2 3 4 5 6 7 8 9
Accuracy: 0.947251
0
1
2
3
4
5
6
7
8
9
(b) Confusion
Figure 5: You will implement a convolutional neural network that produces accuracy
more than 92% on testing data.
You will implement a convolutional neural network (CNN) using a stochastic gradient
descent method. We provide main_cnn. As shown in Figure 4(a), the network is
composed of: a single channel input (14×14×1) → Conv layer (3×3 convolution with
3 channel output and stride 1) → ReLu layer → Max-pooling layer (2 × 2 with stride
2) → Flattening layer (147 units) → FC layer (10 units) → Soft-max.
function [y] = Conv(x, w_conv, b_conv)
Input: x ∈ RH×W×C1
is an input to the convolutional operation, w_conv ∈ RH×W×C1×C2
and b_conv ∈ RC2 are weights and bias of the convolutional operation.
Output: y ∈ RH×W×C2
is the output of the convolutional operation. Note that to get
the same size with the input, you may pad zero at the boundary of the input image.
Description: This convolutional operation can be simplified using MATLAB built-in
function im2col.
function [dLdw, dLdb] = Conv_backward(dLdy, x, w_conv, b_conv, y)
Input: dLdy is the loss derivative with respec to y.
Output: dLdw and dLdb are the loss derivatives with respect to convolutional weights
and bias w and b, respectively.
Description: This convolutional operation can be simplified using MATLAB built-in
function im2col. Note that for the single convolutional layer, ∂L
∂x
is not needed.
function [y] = Pool2x2(x)
Input: x ∈ RH×W×C is a general tensor and matrix.
Output: y ∈ R
H
2 × W
2 ×C is the output of the 2 × 2 max-pooling operation with stride 2.
8
function [dLdx] = Pool2x2_backward(dLdy, x, y)
Input: dLdy is the loss derivative with respect to the output y.
Output: dLdx is the loss derivative with respect to the input x.
function [y] = Flattening(x)
Input: x ∈ RH×W×C is a tensor.
Output: y ∈ RHW C is the vectorized tensor (column major).
function [dLdx] = Flattening_backward(dLdy, x, y)
Input: dLdy is the loss derivative with respect to the output y.
Output: dLdx is the loss derivative with respect to the input x.
function [w_conv, b_conv, w_fc, b_fc] = TrainCNN(mini_batch_x, mini_batch_y)
Output: w_conv ∈ R3×3×1×3
, b_conv ∈ R3
, w_fc ∈ R10×147
, b_fc ∈ R147 are the
trained weights and biases of the CNN.
Description: You will use the following functions to train a convolutional neural
network using a stochastic gradient descent method: Conv, Conv_backward, Pool2x2,
Pool2x2_backward, Flattening, Flattening_backward, FC, FC_backward, ReLu, ReLu_backward,
Loss_cross_entropy_softmax. As a result of training, the network should produce
more than 92% of accuracy on the testing data (Figure 5(b)).
2 Overview
In this assignment, you will implement a stereo reconstruction algorithm given two view
images.
(a) Left image (b) Right image
Figure 1: In this assignment, you will implement a stereo reconstruction algorithm
given two images.
You can download the skeletal code and data (left.bmp and right.bmp) from here:
https://www-users.cs.umn.edu/~hspark/csci5561/HW5.zip
You will fill main_stereo.m that takes input images and intrinsic parameters K, and
produces a stereo disparity map.
2
(a) Matching from I1 to I2 (b) Matching from I2 to I1
(c) Matching from I1 to I2 after ratio test (d) Matching from I2 to I1 after ratio test
(e) Bidirectional matching between I1 and I2
Figure 2: You will match points between I1 and I2 using SIFT features.
You will use VLFeat SIFT to extract keypoints and match between two views using
k-nearest neighbor search. The matches will be filtered using the ratio test and bidirectional consistency check.
function [x1, x2] = FindMatch(I1, I2)
Input: two input gray-scale images with uint8 format.
Output: x1 and x2 are n × 2 matrices that specify the correspondence.
Each row of x1 and x2 contains the (x, y) coordinate of the point correspondence in I1 ad I2, respectively, i.e., x1(i,:) ↔ x2(i,:). This matching function
is similar to HW#2 except that bidirectional consistency check is mandatory.
(Note) Except for SIFT extraction, you are not allowed to use VLFeat functions.
3
Figure 3: Given matches, you will compute a fundamental matrix to draw epipolar
lines.
function [F] = ComputeF(x1, x2)
Input: x1 and x2 are n × 2 matrices that specify the correspondence.
Output: F ∈ R3×3
is the fundamental matrix.
F is robustly computed by the 8-point algorithm within RANSAC. Note
that the rank of the fundamental matrix needs to be 2 (SVD clean-up should be applied.). You can verify the validity of fundamental matrix by visualizing epipolar line
as shown in Figure 3.
(Note) Given the fundamental matrix, you can run the provided function:
[R1 C1 R2 C2 R3 C3 R4 C4] = ComputeCameraPose(F, K)
This function computes the four sets of camera poses given the fundamental matrix
where R1 C1 · · · R4 C4 are rotation and camera center (represented in the world coordinate system) and K is the intrinsic parameter. These four configurations can be
visualized in 3D as shown in Figure 4.
-0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2
0
0.8
0.6
0.4
0.2
0
0.5
0.2
0
0
-0.2
-0.5
-0.4
-1
-0.6
-1.5 0.5
0
-0.5
0
0
0.2
-0.2
0.4
0 0.2 0.4 0.6 0.8 1 -1 -0.5 0 0.5
0.2
0
-0.2
-0.4
-1.5
-0.6
0
Figure 4: Four configurations of camera pose from a fundamental matrix.
5 Triangulation
Given camera pose and correspondences, you will triangulate to reconstruct 3D points.
function [X] = Triangulation(P1, P2, x1, x2)
Input: P1 and P2 are two camera projection matrices (R3×4
). x1 and x2 are n × 2
matrices that specify the correspondence.
Output: X is n × 3 where each row specifies the 3D reconstructed point.
Use the triangulation method by linear solve, i.e.,
u
1
×
P1
v
1
×
P2
X
1
= 0
(Note) Use plot3 to visualize them as shown in Figure 5.
0 50 100
0
-20
-50
-40
-60
-80
0
0
-100
20
-50
40
60
0
80
50
40
20
40 60 80
0
-20 0 20
-20
-80 -60 -40 0
-40
-20
0
20
0-80 -60 -40 -20 0 20 40 60
(a) nValid = 10 (b) nValid = 488
(c) nValid = 0 (d) nValid = 0
Figure 5: You can visualize four camera pose configurations with point cloud.
5
6 Pose Disambiguation
Given four configurations of relative camera pose and reconstructed points, you will
find the best camera pose by verifying through 3D point triangulation.
function [R,C,X] = DisambiguatePose(R1,C1,X1,R2,C2,X2,R3,C3,X3,R4,C4,X4)
Input: R1, C1, X1 · · · R4, C4, X4 are four sets of camera rotation, center, and 3D reconstructed points.
Output: R, C, X are the best camera rotation, center, and 3D reconstructed points.
The 3D point must lie in front of the both cameras, which can be tested
by:
r
T
3
(X − C) > 0 (1)
where r3 is the 3rd row of the rotation matrix. In Figure 5, nValid means the number
of points that are in front of both cameras. (b) configuration produces the maximum
number of valid points, and therefore the best configuration is (b).
6
(a) Rectified image 1 (b) Rectified image 2
Figure 6: Stereo rectification.
Given the disambiguated camera pose, you will implement dense stereo matching between two views based on dense SIFT of VLFeat.
function [H1, H2] = ComputeRectification(K, R, C)
Input: The relative camera pose (R and C) and intrinsic parameter K.
Output: H1 and H2 are homographies that rectify the left and right images such that
the epipoles are at infinity.
Given the disambiguated camera pose, you can find the rectification
rotation matrix, Rrect such that the x-axis of the images aligns with the baseline. Find
the rectification homography H = KRrectRTK−1 where R is the rotation matrix of
the camera. The rectified images are shown in Figure 6. This rectification sends the
epipoles to infinity where the epipolar line becomes horizontal.
(Note) You can use the provided image warping function im_w = WarpImage(im, H)
to check your result.
7
Figure 7: Visualization of stereo match.
function [disparity] = DenseMatch(im1, im2)
Input: two gray-scale rectified images with uint8 format.
Output: disparity map disparity ∈ RH×W where H and W are the image height and
width.
Compute the dense matches across all pixels. Given a pixel, u in the
left image, sweep along its epipolar line, lu, and find the disparity, d, that produces the
best match, i.e.,
d = arg min
i
kd
1
u − d
2
u+(i,0)k
2 ∀i = 0, 1, · · · , N
where d
1
u
is the dense SIFT descriptor at u on the left image and d
2
u+(i,0) is the SIFT
descriptor at u+ (i, 0) (i pixel displaced along the x-axis) on the right image. Visualize
the disparity map as shown in Figure 7.
Reviews
There are no reviews yet.