Intro to Image Understanding (CSC420) Assignment 1
Due Date: October 4th, 2023, 10:59:00 pm Total: 160 marks
General Instructions:
-
You are allowed to work directly with one other person to discuss the questions. How- ever, you are still expected to write the solutions/code/report in your own words; i.e. no copying. If you choose to work with someone else, you must indicate this in your assignment submission. For example, on the first line of your report file (after your own name and information, and before starting your answer to Q1), you should have a sentence that says: “In solving the questions in this assignment, I worked together with my classmate [name & student number]. I confirm that I have written the solu- tions/code/report in my own words”.
-
Your submission should be in the form of an electronic report (PDF), with the answers to the specific questions (each question separately), and a presentation and discussion of your results. For this, please submit a file named report.pdf to MarkUs directly.
-
Submit documented codes that you have written to generate your results separately. Please store all of those files in a folder called assignment1, zip the folder and then submit the file assignment1.zip to MarkUs. You should include a README.txt file (inside the folder) which details how to run the submitted codes.
-
Do not worry if you realize you made a mistake after submitting your zip file; you can submit multiple times on MarkUs until the deadline.
Part I: Theoretical Problems (80 marks)
[Question 1] Convolution (10 marks)
[1.a] (5 marks) Calculate and plot the correlation and the convolution of x[n] and h[n] specified below:
(
x[n] = 2 −2 ≤ n ≤ 4
0 otherwise
h[n] = 1 −3 ≤ n ≤ 1
(
0 otherwise
(1)
[1.b] (5 marks) Calculate and plot the correlation and the convolution of x[n] and h[n] specified below:
(
x[n] = 2 −2 ≤ n ≤ 4
0 otherwise
h[n] = 2 − |n| −3 ≤ n ≤ 1
(
0 otherwise
(2)
[Question 2] Polynomial Multiplication and Convolution (5 marks)
Vectors can be used to represent polynomials. For example, 3rd-degree polynomial (a3x3 +
a2x2 + a1x + a0) can by represented by vector [a3, a2, a1, a0].
If u and v are vectors of polynomial coefficients, prove that convolving them is equivalent to multiplying the two polynomials they each represent.
Hint: You need to assume proper zero-padding to support the full-size convolution.
[Question 3] Laplacian Operator (10 marks)
The Laplace operator is a second-order differential operator in the “n”-dimensional Euclidean space, defined as the divergence (∇.) of the gradient (∇f ). Thus if f is a twice-differentiable real-valued function, then the Laplacian of f is defined by:
Σ
n 2
∆f = ∇2f = ∇ · ∇f = ∂ f
∂x2
i=1 i
where the latter notations derive from formally writing:
∂x1
∂xn
∇ = ∂ , . . . , ∂ .
′
Now, consider a 2D image I(x, y) and its Laplacian, given by ∆I = Ixx +Iyy. Here the second partial derivatives are taken with respect to the directions of the variables x, y associated with the image grid for convenience. Show that the Laplacian is rotation invariant. In other words, show that ∆I = Irr + Ir′r′ , where r and r are any two orthogonal directions.
Hint: Start by using polar coordinates to describe a chosen location (x, y). Then use the chain rule.
[Question 4] Computational Complexity (5 marks)
Assume that we have a convolution implementation J = conv2(F, I) that takes two im- ages (Fk×k and In×n) and returns the output in O(k2n2) time. Given an image In×n and two filters Fk×k and Gk×k, we want to compute G ∗ (F ∗ I). Is it more efficient to call conv2(G, conv2(F, I)) or conv2(conv2(G, F ), I)? Briefly justify your answer.
[Question 5] Image Pyramids (10 marks)
In Gaussian pyramids, the image at each level Ik is constructed by blurring the image at the previous level Ik−1 and downsampling it by a factor of 2. A Laplacian pyramid, on the
other hand, consists of the difference between the image at each level (Ik) and the upsampled version of the image in the next level of the Gaussian pyramid (Ik+1).
Given an image of size 2n × 2n denoted by I0, and its Laplacian pyramid representation denoted by L0, …, Ln−1, show how we can reconstruct the original image, using the minimum information from the Gaussian pyramid. Specify the minimum information required from
the Gaussian pyramid and a closed-form expression for reconstructing I0.
Hint: The reconstruction follows a recursive process; What is the base case that contains the minimum information?
Hint: Express the output of a network as a function of its inputs and its weights of layers.
[Question 6] Back Propagation (10 marks)
Consider a neural network that represents the following function:
yˆ = σ(w5σ(w1x1 + w2x2) + w6σ(w3x3 + w4x4))
where xi denotes input variables, yˆ is the output variable, and σ is the logistic function:
1
σ(x) = 1 + e−x
Suppose the loss function used for training this neural network is the L2 loss, i.e. L(y, yˆ) = (y − yˆ)2. Assume that the network has its weights set as:
(w1, w2, w3, w4, w5, w6) = (−0.20, −0.3, 0.40, 0.50, −0.40, 0.3)
[3.a] (5 marks) Draw the computational graph for this function. Define appropriate in- termediate variables on the computational graph. (break the logistic function into smaller components.)
w
[3.b] (5 marks) Given an input data point (x1, x2, x3, x4) = (−1.1, 1.3, −1.5, 2.0) with true label of 0.0, compute the partial derivative ∂L, by using the back-propagation algorithm.
4
Indicate the partial derivatives of your intermediate variables on the computational graph.
Round all your calculations to 4 decimal places.
Hint: For any vector (or scalar) x, we have ∂ (||x||2) = 2x. Also, you do not need to write
∂x 2
any code for this question! You can do it by hand.
[Question 7] CNN FLOPs (10 marks)
In this problem, our goal is to estimate the computation overhead of CNNs by counting the FLOPs (floating point operations). Consider a convolutional layer C followed by a max pooling layer P . The input of layer C has 50 channels, each of which is of size 12 × 12. Layer
C has 20 filters, each of which is of size 4 × 4. The convolution padding is 1 and the stride is
-
Layer P performs max pooling over each of the C’s output feature maps, with 3 × 3 local receptive fields, and stride 1.
Given scalar inputs x1, x2, …, xn, we assume:
-
A scalar multiplication xi.xj accounts for one FLOP.
-
A scalar addition xi + xj accounts for one FLOP.
-
A max operation max(x1, x2, …, xn) accounts for n − 1 FLOPs.
-
All other operations do not account for FLOPs.
-
How many FLOPs layer C and P conduct in total during one forward pass, with and without accounting for bias?
Hint: Find the size of the output tensor for layer C. Per output dimension, first calculate the number of multiplications then the additions (with or without bias). For layer P, follow the same procedure.
[Question 8] Trainable Parameters (10 marks)
The following CNN architecture is one of the most influential architectures that was pre- sented in the 90s. Count the total number of trainable parameters in this network. Note that the Gaussian connections in the output layer can be treated as a fully connected layer similar to F 6.
[Question 9] Logistic Activation Function (10 marks)
For backpropagation in a node with logistic activation function, show that, in order to com- pute the gradient, as long as we have the output of the node, there is no need for the input.
Hint: Find the derivative of a neuron’s output with respect to its inputs.
Part II: Implementation Tasks (80 marks)
In this question, we train (or fine-tune) a few different neural network models to classify dog breeds. We also investigate their dataset bias and cross-dataset performances. All the tasks should be implemented using Python with a deep learning package of your choice, e.g. PyTorch or TensorFlow.
We use two datasets in this assignment.
-
The Stanford Dogs Dataset (SDD) contains over 20,000 images of 120 different dog breeds. The annotations available for this dataset include class labels (i.e. dog breed name) and bounding boxes. In this assignment, we’ll only be using the class labels. Further, we will only use a small portion of the dataset (as described below) so you can train your models on Colab. Dog Breed Images (DBI) is a smaller dataset containing images of 10 different dog breeds.
To prepare the data for the implementation tasks, follow these steps:
-
Download both datasets and unzip them. There are 7 dog breeds that appear in both datasets:
-
Bernese mountain dog
-
Border collie
-
Chihuahua
-
Golden retriever
-
Labrador retriever
-
Pug
-
Siberian husky
-
-
Delete the folders associated with the remaining dog breeds in both datasets. You can also delete the folders associated with the bounding boxes in the SDD.
-
For the 7 breeds that are present in both datasets, the names might be written slightly differently (e.g. Labrador Retriever vs. Labrador). Manually rename the folders so the names match (e.g. make them both labrador retriever ).
-
Rename the folders to indicate that they are subsets of the original datasets (to avoid potential confusion if you later want to use them for another project). For example, SDDsub- set and DBIsubset. Each of these should now contain 7 subfolders (e.g. border collie, pug, etc.) and the names should match.
-
Zip the two folders (e.g. SDDsubset.zip and DBIsubset.zip) and upload them to your Google Drive (if you want to use Google Colab).
You can find sample code working with the SDD on the internet. If you want, you are welcome to look at these examples and use them as your starting code or use code snippets from them. You will need to modify the code as our questions are asking you to do different tasks, which are not the same as the ones in these online examples. But using and copying code snippets from these resources is fine. If you choose to use one of these online examples as your starting code, please acknowledge them in your submission. We also suggest that before starting to modify the starting code, you run them as is on your data (e.g. DBIsubset) to 1) make sure your dataset setup is correct and 2) to make sure you fully understand the starter code before you start modifying it.
Task I – Inspection (5 marks):
Look at the images in both datasets, and briefly explain if you observe any systematic dif- ferences between images in one dataset vs. the other.
Task II – simple CNN Training on the DBI (10 marks):
Construct a simple convolutional neural network (CNN) for classifying the images in DBI. For example, you can construct a network as follow:
-
convolutional layer – 16 filters of size 3×3
-
batch normalization
-
convolutional layer – 16 filters of size 3×3
-
max pooling (2×2)
-
convolutional layer – 8 filters of size 3×3
-
batch normalization
-
convolutional layer – 8 filters of size 3×3
-
max pooling (2×2)
-
dropout (e.g. 0.5)
-
fully connected (32)
-
dropout (0.5)
-
softmax
If you want, you can change these specifications; but if you do so, please specify them in your submission. Use RELU as your activation function, and cross-entropy as your cost function. Train the model with the optimizer of your choice, e.g., SGD, Adam, RMSProp, etc. Use random cropping, random horizontal flipping, random colour jitter, and random rotations for augmentation. Make sure to tune the parameters of your optimizer for getting the best performance on the validation set.
Plot the training, and test accuracy over the first 10 epochs. Note that the accuracy is
different from the loss function; the accuracy is defined as the percentage of images classified correctly.
Train the same CNN model again; this time, without dropout. Plot the training and test accuracy over the first 10 epochs; and compare them with the model trained with dropout. Report the impact of dropout on the training and its generalization to the test set.
Task III – ResNet Training on the DBI (15 marks):
[III.a] (10 marks) ResNet models were proposed in the “Deep Residual Learning for Image Recognition” paper. These models have had great success in image recognition on benchmark datasets. In this task, we use the ResNet-18 model for the classification of the images in the DBI dataset. To do so, use the ResNet-18 model from PyTorch, modify the input/output layers to match your dataset, and train the model from scratch; i.e., do not use the pre- trained ResNet. Plot the training, validation, and testing accuracy, and compare those with the results of your CNN model.
[III.b] (5 marks) Run the trained model on the entire SDD dataset and report the accuracy. Compare the accuracy obtained on the (test set of) DBI, vs. the accuracy obtained on the SDD. Which is higher? Why do you think that might be? Explain very briefly, in one or two sentences.
Task IV – Fine-tuning on the DBI (20 marks):
Similar to the previous task, use the following models from PyTorch (within torchvision): ResNet18, ResNet34, ResNeXt50, SwinTransformer (tiny), and a fifth model of your choosing from torchvision or timm. ResNet18, ResNet34, and ResNeXt50 are convolutional networks and Swin is a transformer-based architecture. For fine-tuning, you will need to replace the final layer so the output matches the number of classes in your dataset. Hint: The final layer might have a different name in each model.
This time you are supposed to use the pre-trained models and fine-tune the input/output layers on DBI training data. Report the accuracy of these fine-tuned models on DBI test dataset, and also the entire SDD dataset.
Discuss the cross-dataset performance of these trained models. Which models generalized to the new dataset better? For example, are there cases in which two different models perform equally well on the test portion of the DBI but have significant performance differences when evaluated on the SDD? Are there models for which the performance gap between the SSD and test portion of DBI are very small?
Task V – Dataset detection (15 marks):
Train a model that – instead of classifying dog breeds – can distinguish whether a given image is more likely to belong to SDD or DBI. To do so, first, you need to divide your data
into training and test data (and possibly validation if you need those for tuning the hyper- parameters of your model). You need to either reorganize the datasets (to load the images using torchvision.datasets.ImageFolder ) or write your own data loader function. You can start from a pre-trained model (of your choice) and fine-tune it on the training portion of the dataset. Include your network model specifications in the report, and make sure to include your justifications for that choice. Report your model’s accuracy on the test portion of the dataset.
Task VI – How to improve performance on SDD? (10 marks):
If our goal were to have good performance on the SDD dataset, briefly discuss how to work towards this goal in each of the following cases: (you don’t need to implement these, just briefly discuss each case in 2-3 sentences)
-
At training time, we have access to the entire DBI dataset, but none of the SDD dataset. All we know is a high level description of SDD and its differences with DBI (similar to the answer you provided for Task I of this question).
-
At training time, we have access to the entire DBI dataset and a small portion (e.g. 10%) of the SDD dataset.
-
At training time, we have access to the entire DBI dataset and a small portion (e.g. 10%) of the SDD dataset but without the SDD labels for this subset.
-
Task VII – Discussion (5 marks):
Briefly discuss how some of the issues that were examined in this exercise can have implica- tions in real application, e.g. as related to bias or performance. For example, consider the case where available training datasets are collected in one setting (e.g. a university) and the goal is to deploy trained models in another setting (e.g. a retirement home).
Summary of implementation tasks:
The train/val/test split is up to you. You can use the same split used in the sample code linked, i.e. train: 60%. validation 10%, test: 30%. But you can do other (reasonable) splits too if you want. Just specify in your report what you did.
what we want to do |
Train |
Validation |
Test |
|
Task II |
dog breed classification |
DBI |
DBI |
DBI |
Task III.a |
dog breed classification |
DBI |
DBI |
DBI |
Task III.b |
dog breed classification |
DBI |
DBI |
SDD |
Task IV |
dog breed classification |
DBI |
DBI |
both |
Task V |
dataset classification |
Reviews
There are no reviews yet.