Backpropagation
1. Why do we need to rescale the input data before training an MLP?
MLPs use metric information (distances) to determine di error. Therefore, dimensions whose scale is far larger than others will dominate the error, making the smaller-scale features irrelevant.
2. How does the error propagate from the output layer to the hidden layer?
Copyright By Assignmentchef assignmentchef
Through the : j=(zjtj)zj(1zj) .
3. Why de we compute the gradient of the error?
In order to perform gradient descent. The anti-gradient (the direction opposite to the gradient) is the steepest direction for descent.
4. Derive the update rule of the output neurons according to backpropagation.
We aim to minimise the mean squared error, that is: E=12(yt)2, where y is the output of
the MLP, and t is the desired class. The output y=(a)) is a sigmoid function, applied to the dot product between the weights of the output neuron and its inputs (including the bias
input): a=wi zi. We derive the gradient of the error with respect to each weight by i
applying the chain rule: E = E a =(yt) y(1y)zi ,
wi a wi
where we used the derivative of the sigmoid:((x))=(x)(1(x)). Substituting the
gradient into the general update rule for gradient descent: xt+1=xtf (xt) we obtain the update rule for the output weights:
wt+1=wt(yt)y(1y)z
5. Derive the update rule of the hidden neurons according to backpropagation. For a hidden neuron j, we need to compute its :
j=E=E ak=k wjk zj(1zj) , where the ak are the output of the aj k akaj k
neurons that follow neuron j in the network, ak=wjk(aj)=wjk(aj)(1(aj)). , aj j
and (a j)=z j . Then, we can chain this with the derivative of a j with respect to any of
the weights connected to its input, wij : aj =zi . So the gradient of the error with wij
respect to wij is E =zik wjk zj(1zj) , which leads to the update rule: wij k
w(t+1)=w(t)zw z(1z) . ij ij i kjkj j
CS: assignmentchef QQ: 1823890830 Email: [email protected]
Reviews
There are no reviews yet.