The convolution layer inside of a CNN is intrinsically an affine transformation: A vector is received
as input and is multiplied with a matrix to produce an output (to which a bias vector is usually
added before passing the result through a nonlinearity). This operation can be represented as
y = Ax, in which A describes the affine transformation.1. [2 points] We will first revisit the convolution layer as discussed in the class. Consider a
convolution layer with a 2×2 kernel W, operated on a single input channel X, represented as:
W =
w(0,0), w(0,1)
w(1,0), w(1,1)
, X =
x(0,0), x(0,1), x(0,2)
x(1,0), x(1,1), x(1,2)
x(2,0), x(2,1), x(2,2)
(1)Now let us work out a stride-3 convolution layer, with zero padding size of 1. Consider
‘flattening’ the input tensor X in row-major order as:
X =x(0,0), x(0,1), x(0,2), …, x(2,0), x(2,1), x(2,2)>
(2)
Write down the convolution as a matrix operation A that: Y = AX. Output Y is also
flattened in row-major order.2. [2 points] Recall that transposed convolution can help us upsample the feature size spatially.
Consider a transposed convolution layer with a 2×2 kernel W operated on a single input
channel X, represented as:
W =
w(0,0), w(0,1)
w(1,0), w(1,1)
, X =
x(0,0), x(0,1)
x(1,0), x(1,1)
(3)We ‘flattened’ the input tensor in row-major order as X = [x(0,0), x(0,1), x(1,0), x(1,1)].
Write down the affine transformation A corresponding to a transposed convolution layer with
kernel W, stride 2, no padding. Output Y is also flattened in row-major order.3. [4 points] Convolution layers in most CNNs consist of multiple input and output feature
maps. The collection of kernels form a 4D tensor (output channels o, input channels i, filter
rows k, filter columns k), represented in short as (o, i, k, k). For each output channel, each
input channel is convolved with a distinct 2D slice of the 4D kernel and the resulting set of
feature maps is summed element-wise to produce the corresponding output feature map.There is an interesting property that a convolutional layer with kernel size (o × r
2
, i, k, k) is
identical to a transposed convolution layer with kernel size (o, i, k × r, k × r). Here the word
‘identical’ means with the same input feature X, both operations will give the same output
Y with only a difference in the ordering of flattened elements of Y .Now let us prove the property in a restricted setting. Consider o = 1, r = 2, i = 1, k = 1.
Given the same input feature X as in equation 3, write down the affine transformation for
a convolutional layer with kernel size (4, 1, 1, 1), and show it is an operation identical to a
transposed convolution layer with kernel size (1, 1, 2, 2).1. [4 points] Implement AND and OR for pairs of binary inputs using a single linear threshold
neuron with weights w ∈ R
2
, bias b ∈ R, and x ∈ {0, 1}
2
:
f(x) = (
1 if wT x + b ≥ 0
0 if wT x + b < 0
(4)
That is, find wAND and bAND such that
x1 x2 fAND(x)
0 0 0
0 1 0
1 0 0
1 1 1
Also find wOR and bOR such that
x1 x2 fOR(x)
0 0 0
0 1 1
1 0 1
1 1 12. [4 points] Consider the XOR function
x1 x2 fXOR(x)
0 0 0
0 1 1
1 0 1
1 1 0
Show that XOR can NOT be represented using a linear model with the same form as (4).
[Hint: To see why, plot the examples from above in a plane and think about drawing a linear
boundary that separates them.]Consider a specific 2 hidden layer ReLU network with inputs x ∈ R, 1 dimensional outputs, and 2
neurons per hidden layer. This function is given by
h(x) = W(3) max{0, W(2) max{0, W(1)x + b
(1)} + b
(2)} + b
(3) (5)
with weights:
W(1) =
0.5
0.5
(6)
b
(1) =
0
1
(7)
W(2) =
1 1
1 1
(8)
b
(2) =
0
0
(9)
W(3) =1 1
(10)
b
(3) = 1 (11)An interesting property of networks with piecewise linear activations like the ReLU is that on the
whole they compute piecewise linear functions. For each of the following points give the weight
W ∈ R and bias b ∈ R (report the numerical values) which computes such that W x + b = h(x).Also compute the gradient dh
dx evaluated at the given point.
1. [2 points]
x = 1 (12)
2. [2 points]
x = −1 (13)
3. [2 points]
x = −0.5 (14)Now we’ll turn to a more recent result that highlights the Deep in Deep Learning. Depth (composing
more functions) results in a favorable combinatorial explosion in the “number of things that a neural
net can represent”. For example, to classify a cat it seems useful to first find parts of a cat: eyes, ears,
tail, fur, etc. The function which computes a probability of cat presence should be a function of these
components because this allows everything you learn about eyes to generalize to all instances of eyes
instead of just a single instance. Below you will detail one formalizable sense of this combinatorial
explosion for a particular class of piecewise linear networks.Figure 1
Consider y = σ(x) = |x| for scalar x ∈ X ⊆ R and y ∈ Y ⊆ R (Fig. 1). It has one linear region on
x < 0 and another on x > 0 and the activation identifies these regions, mapping both of them to
y > 0. More precisely, for each linear region of the input, σ(·) is a bijection. There is a mapping
to and from the output space and the corresponding input space. However, given an output y, it’s
impossible to tell which linear region of the input it came from, thus σ(·) identifies (maps on top of
each other) the two linear regions of its input. This is the crucial definition because when a function
identifies multiple regions of its domain that means any subsequent computation applies to all of
those regions. When these regions come from an input space like the space of images, functions
which identify many regions where different images might fall (e.g., slightly different images of a
cat) automatically transfer what they learn about a particular cat to cats in the other regions.
More formally, we will say that σ(·) identifies a set of M disjoint input regions R = {R1, . . . , RM}
(e.g., R = {(−1, 0),(0, 1)}) with Ri ⊆ X onto one output region O ⊆ Y (e.g., (0, 1)) if for all
Ri ∈ R there is a bijection from Ri to O.1. [4 points] Start by applying the above notion of identified regions to linear regions of one
layer of a particular neural net that uses absolute value functions as activations. Let x ∈ R
d
,
y ∈ R
d 2
, and pick weights W(1) ∈ R
d×d and bias b
(1) ∈ R
d as follows:
W
(1)
ij =
(
2 if i = j
0 if i 6= j
(15)
b
(1)
i = −1 (16)Then one layer of a neural net with absolute value activation functions is given by
f1(x) = |W(1)x + b| (17)
Note that this is an absolute value function applied piecewise and not a norm.
How many regions of the input are identified onto O = (0, 1)d by f1(·)? Prove it. 31Recall that a bijection from X to Y is a function µ : X → Y such that for all y ∈ Y there exists a unique x ∈ X
with µ(x) = y.
2Outputs are in some feature space, not a label space. Normally a linear classifier would be placed on top of what
we are here calling y.
3Absolute value activations are chosen to make the problem simpler, but a similar result holds for ReLU units.
Also, O could be the positive orthant (unbounded above).
5
2. [4 points] Next consider what happens when two of these functions are composed. Suppose
g identifies ng regions of (0, 1)d onto (0, 1)d and f identifies nf regions of (0, 1)d onto (0, 1)d
.
How many regions of its input does f ◦ g(·) identify onto (0, 1)d
?3. [6 points] Finally consider a series of L layers identical to the one in question 3.1.
h1 = |W1x + b1| (18)
h2 = |W2h1 + b2| (19)
.
.
. (20)
hL = |WLhL−1 + bL| (21)
Let x ∈ (0, 1)d and f(x) = hL. Note that each hi
is implicitly a function of x. Show that
f(x) identifies 2
Ld regions of its input.Now compare the number of identified regions for an L layer net to that of an L − 1 layer net. The
L layer net can separate its input space into 2
d more linear regions than the L − 1 layer net. On
the other hand, the number of parameters and the amount of computation time grows linearly in
the number of layers. In this very particular sense (which doesn’t always align well with practice)
deeper is better.To summarize this problem set, you’ve shown a number of results about the representation power
of different neural net architectures. First, neural nets (even single neurons) can represent logical
operations. Second, neural nets we use today compute piecewise linear functions of their input.
Third, the representation power of neural nets increases exponentially with the number of layers.
The point of the exercise was to convey intuition that removes some of the magic from neural nets
representations. Specifically, neural nets can decompose problems logically, and piecewise linear
functions can be surprisingly powerful.6 Coding: Uses of Gradients With Respect to Input [50 points for
CS7643, 40 points for CS4803]
The coding part of this assignment will have you implement 3 (CS4803) or 4 (CS7643) different ideas
which all rely on gradients of some neural net output with respect to the input image. To get started
go to https://www.cc.gatech.edu/classes/AY2020/cs7643_fall/CQQNDw8CAwYGAAQLBAUECw/hw2-q6/.
CS4803/7643:, Deep, Learning, Problem, solved
[SOLVED] Cs4803/7643: deep learning problem set 2
$25
File Name: Cs4803_7643__deep_learning_problem_set_2.zip
File Size: 376.8 KB
Reviews
There are no reviews yet.