[SOLVED] 06-32167 32167 LH Neural Computation

$25

File Name: 06-32167_32167_LH_Neural_Computation.zip
File Size: 339.12 KB

Category: Tag:
Rate this Assignment

School of Computer Science

Third Year Undergraduate

06-32167

32167 LH Neural Computation

Main Summer Examinations 2022 [Answer all questions]

Neural Computation

32167 LH Neural Computation

Question 1

Let us consider solving regression problems with a neural network. In particular, we con- sider a neural network of the following structure:

As illustrated in the lecture, we have the following relationship between variables in the neural network.

2 z 2 ω2 ω2 a1 b2 2 a2 σ(z 2)

z

ω

ω

a

b

a

2

+

, a

=

z = 1 =

2

2

11 12

2 2

21 22

1 1 1

1 2 2

2 2 2

= 1 ,

σ(z 2)

where σ is the activation function. For simplicity of computation, we always use σ(x ) = x 2

in this neural network. In a similar way, there is also a relationship between z 3, a3 and a2.

1 1

  1. Compute the number of trainable parameters required in determining this neural network. Please explain your answer. [3 marks]

  2. Suppose

    2 ω2 ω2 1 1 3 3 3

    ω

    ω

    2

    =

    W = 11

    22

    21

    12

    2 1 1

    2 b2

    , W

    0

    = (ω11, ω12) = (1, 1),

    3

    b

    2

    b = 1

    2

    = 1 , b1 = 3.

    Consider the training example

    1

    2

    x =

    1 1

    a

    1

    2

    , y = 1.

    a

    =

    Let us consider the square loss function Cx,y (W, b) = 1 (a3 y )2, where W =

    2 1

    {W2, W3}, b = {b2, b3}. Use the forward propagation algorithm to compute a2, a3

    1 1

    and the loss Cx,y (W, b) for using the neural network to do prediction on the above

    example (x, y ). Please write down your step-by-step calculations. [7 marks]

    1

  3. Let us consider the neural network with the above W2, W3, b2, b3 and the above

training example x, y . Use the back propagation algorithm to compute the gradients. For simplicity, we only require you to compute the explicit number of

1

2

∂Cx,y (W, b) , ∂Cx,y (W, b) , ∂Cx,y (W, b) , ∂Cx,y (W, b) , ∂Cx,y (W, b) .

1

∂ω

∂ω

∂z 3

∂z 2

∂z 2

3 3

11 12

Please write down your step-by-step calculations. [10 marks]

Question 2

Given the weights (w1, w2, w3) and the biases (b2, b3), we have the following recurrent neural network (RNN) which takes in an input vector xt and a hidden state vector ht1 and returns an output vector yt :

yt = g(w3f(w1xt + w2ht1 + b2) + b3), (1)

where g and f are activation functions. The following computational graph depicts such a RNN.

Figure 1: RNN Computational Graph

  1. Write down clearly which part of Equation (1) defines the current (updated) hidden state vector ht shown in Figure 1. [3 marks]

  2. When t = 3 (starting from 1), please show how information is propagated through time by drawing an unfolded feedforward neural network that corresponds to the RNN in Figure 1. Please make sure that hidden states, inputs and outputs as well as network weights and biases are annotated on your network. [4 marks]

  3. Assume xt , ht1, ht and yt are all scalars in Equation (1), and the activation functions are a linear unit and a binary threshold unit, respectively defined as:

    g(x ) = x,

    f(x ) = 0 if x < 0 .

    1 if x 0

    When t = 3 (starting from 1), please calculate the values of the outputs (y1, y2, y3) given (w1 = 1, w2 = 3, w3 = 5), (b2 = 1, b3 = 3), (x1 = 5, x2 = 3, x3 = 1) and

    h0 = 0. Please show your calculations in detail. [3 marks]

  4. Again let us assume xt , ht1, ht and yt are all scalars with h0 = 0 and the activation functions the same as above. Compute (w1, w2, w3) and (b2, b3) such that the network outputs 0 initially, but when it receives an input of 1, it outputs 1 for all subsequent time steps. For example, if the input is 00001000100, the output will be 00001111111. Please justify your answer.

Note: here we want a solution that satisfies (1) the hidden state ht is zero until the input xt becomes 1, at which point the hidden state changes to 1 forever, and (2) the output always predicts the same as the hidden state, i.e. yt = ht . [10 marks]

Question 3

Note: Each item below can be answered with approximately 5 lines of text. This is only an informal indication, not a hard constraint. Length of answers will not influence marks.

  1. Consider the Variational Auto-Encoder (VAE), with encoder fφ(x ) that predicts mean µφ(x ) and standard deviation σφ(x ) of a multi-dimensional Gaussian that is the conditional pφ(z |x ), and decoder gθ(z ). The VAE’s loss for each d-dimensional input vector x is:

    LV AE = λrec Lrec (x ) + λregLreg(x ), (2)

    where Lrec (x ) = 1 Σd (x (j)g(j)(z˜))2 for sample z˜pφ(z |x ) is reconstruction loss,

    Lreg(x ) = 1

    v

    (µ(j)(x ))2 + (σ(j)(x ))2 2 log

    σ(j)(x ) 1

    Σ hd j =1 θ i

    2

    j =1

    φ

    φ

    e

    φ

    is the regularizer, z is

    a v -dimensional vector and loge is the natural logarithm. λrec , λreg are non-trainable scalars for weighting Lrec and Lreg. h(j) denotes the j-th element of vector h.

    1. If you train minimizing only the regularizer Lreg (i.e. λrec = 0 and λreg > 0), what values do you expect the encoder will tend to predict for means µφ(x ) and standard deviations σφ(x )? Explain why. [4 marks]

    2. Assume that z is 2 dimensional (i.e. v = 2). Assume that for an input data point x1 the encoder outputs vectors µφ(x1) = (0.5, 0.1) and σφ(x1) = (0.1, 0.3). Calculate the value of Lreg(x1). Show the steps of the calculation. (Note: For simplicity, use loge 0.1 ≈ −2.3 and loge 0.3 ≈ −1.2 ) [4 marks]

    3. Assume you are given an implementation of the above VAE with a bottleneck (i.e. v < d ). You are asked to train the VAE so that it will be as good as possible for the task of compressing data (via bottleneck) and uncompressing them with fidelity. Generation of fake data or other applications are not of interest. What values would you choose for λrec and λreg? For each, specify either equal to 0 or greater than 0. Explain why. [5 marks]

  2. Consider a Generative Adversarial Network (GAN) that consists of a Generator G that takes input noise vector z and outputs G(z ), and Discriminator D that given input x it outputs D(x ). We assume that D(x ) = 1 means that D predicts with certainty that input x is a real data point, and D(x ) = 0 means D predicts with certainty that x is a fake, generated sample. Figure 2 shows two loss functions that could be used for training G. Which of the two loss functions in Figure 2 is more appropriate for training G in practice? Explain why, based on the gradients for lowest and highest values of D(G(z )) and how they would influence training. [7 marks]

Figure 2: Two loss functions that could be used (minimized) for training Generator G of a GAN. Shown as a function of Discriminator’s D predicted probability that a generated sample G(z ) is real. On the x-axis, D(G(z )) = 1 means D predicts with certainty that G(z ) is real, whereas D(G(z )) = 0 means D predicts with certainty that G(z ) is fake.

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] 06-32167 32167 LH Neural Computation
$25