MATH49111/69111
Project 2a: Neural Networks
1 Introduction
The idea that computers may be able to think or exhibit intelligence dates back to the dawn of the computing era. Alan Turing argued in his seminal 1950 paper Computing Machinery and Intelligence that digital computers should be capable of thinking, and introduced the famous ‘Turing test’ to define what was meant by ‘thinking’ in this context: the computer should be able to emulate a human mind. He also suggested that a computer exhibiting human intelligence could not be programmed in the usual way (where the programmer has ‘a clear mental picture of the state of the machine at each moment in the computation’, as he put it) but would instead learn like a human, drawing inferences from examples. Turing had previously identified that learning of this sort could occur in what he called an unorganised machine: a computational model of a brain consisting of many simple logical units connected together in a network to form a large and complex system.
Although a computer emulation of a complete human mind is still some way off, Turing’s idea that a network could be taught and learn was prescient. Models of this sort are now known as Artificial Neural Networks (ANNs), and the process by which they learn from examples is known as supervised learning, part of the large field of machine learning. In the last decade or so, ANNs have developed to become the best algorithms available for a range of ‘hard’ problems, including image recognition and classification, machine translation, text-to-speech, and playing of complex games such as Go.
In this project we will develop a simple neural network and train it to classify data into one of two classes, a so-called binary classification problem. (This problem is discussed further by Higham and Higham (2019) whose notation we use in this project.) Suppose we have a set of points in R2 , as shown in Fig. 1, acquired from some real-world measurements, with each point being of one of two classes, either a blue cross or a red circle:
Figure 1: A set of data points in R2 , each of which is of one of two types .
This type of data could arise in many applications, for example the success or failure of an industrial chemical process as a function of ambient temperature and humidity, or whether a customer has missed a loan payment as a function of their income and credit score.
Given a new point x ∈ R2 , we’d like to be able to predict from the existing data which of the two classes this new point is likely to be in. Since our example data shows a clear pattern (red circles tend to lie at larger values of x1 than blue crosses), we could sketch a curve by eye (as in Fig. 2) that divides the plane into two, with the red circles on one side and the blue crosses on the other.
Figure 2: The plane is divided by a curve, with points on each side classified differently. Given a new point on the plane, we can now estimate which class it is in by seeing which side of the line it lies.
We could then classify any new point x ∈ R2 by seeing on which side of the curve it lies. Our neural network will perform this classification for us by (1) learning from existing data where the boundary between the two classes of data point lies, and (2) allowing us to evaluate the likely class of a new data point by using the boundary found in step (1). Specifically, the network defines a function F (x), where the curve dividing our two types of point is the zero contour F (x) = 0. Regions where the network outputs a positive value F (x) > 0 correspond to one class, and regions where it is negative to the other class.
Although this two-dimensional problem is relatively simple, the neural network developed extends naturally to much more difficult classification problems in higher dimensions (where x has hundreds or thousands of components), with more than two classes of points, and where the boundaries between different regions are not as clear as in our example above. One example is character recognition, as used by banks and postal services to automatically read hand-written cheques and postcodes. In this case, a segmentation algorithm is applied to split the scanned text into lots of small images each containing a single digit. Suppose each such small image has 25 × 25 = 625 greyscale pixels; the data in it can be represented by a vector x ∈ R625 . A neural network of exactly the sort we develop can be used to classify a handwritten character encoded in the vector x into one of 36 classes (one for each of the characters 0–9 and A–Z) with very good accuracy.
2 Neural Networks: Theory
2.1 Overall setup
An artificial neural network consists of a set of neurons. A neuron is a simple scalar functional unit with a single input and a single output, modelled by an nonlinear activation function σ : R → R. In the feed-forward networks we will study, these neurons are arranged in a series of layers, with the outputs of neurons in one layer connected to the inputs of neurons in the next, as shown in Fig. 3. The networks in this project will also be fully connected, meaning that nodes in two adjacent layers and the connections between them form a complete bipartite graph (every neuron in one layer is connected to every neuron in the next).
Input layer Layer 2 Layer 3 Output layer
Figure 3: Fully-connected network topology with two input neurons, two hidden layers of four neurons each, and one output neuron.
The inputs to the neural network are specified at the first layer; in our classification problem, this consists of the two components of the input vector x = (x1 , x2 ). The output of the network is the output of the final layer of neurons; in our example this is just a single neuron, outputting a real number that we hope is close to either +1 or −1, depending on the class of the point. In between are one or more hidden layers. The network as a whole defines a nonlinear function, in this case R2 → R (two inputs, one output).
As noted above, there are two stages to using a neural network.
1. The network is trained by specifying both the input to the network and the desired (target) output. Through an iterative procedure (“learning”), parameters relating to the connections between neurons (known as the weights and biases, defined below) are modified so that the network reproduces the target output for a wide range of input data. The overall structure of the network (the number of layers, and the number of neurons in each layer) is not changed in the training.
2. The weights and biases are then fixed, and the output of the network is evaluated for arbitrary new input data.
We will consider initially the second of these steps, assuming that the network has been pre-trained and that the weights and biases are known.
2.2 Evaluating the neural net output
Since each neuron is a simple scalar function (R → R), the multiple inputs into a neuron from neurons in the previous layer must be combined into a single scalar input. To describe how this is done we first establish some notation. Let L represent the total number of layers in the network and nl represent the number of neurons at layer l, for l = 1, 2, . . . , L. The input to the jth neuron at layer l is denoted by zj[l] and the output from the jth neuron at layer l by aj[l] . The statement that each neuron is modelled by an activation function σ can then be written algebraically as
for l = 2, 3, . . . , L, j = 1, . . . , nl , (1)
where the subscript. on the activation function implies that different layers may use different activation functions (but all the neurons in a layer use the same one). The statement that the inputs to the neural network are specified at the first layer can be written as
= xj , j = 1, . . . , n1 .
Note that the input to the network xj determines the outputs aj([1]) of the neurons in the
first layer rather than their inputs, so (1) is not evaluated for the first layer l = 1. The output of the network is the output of the final layer of neurons, aj[L] . Consequently, the network has n1 (scalar) inputs and nL (scalar) outputs. The notation can be simplified by defining the neuron inputs and outputs as vectors z[l] , a[l] ∈ Rnl , with components zj[l] and aj[l] respectively. Then, defining the activation function to act componentwise, (1) becomes
a[l] = σl (z[l]) , for l = 2, 3, . . . , L. (3)
The activation function is typically a nonlinear monotonically increasing function, but may have one of a number of different functional forms. One common choice is the tanh function,
σ(z) = tanh(z). (4)
While (3) and (4) together define the behaviour of the individual neurons, the more sig- nificant part of a neural network is the set of connections between neurons, which determine how the input zj[l] (of a neuron j in layer l) is determined from the outputs of the neurons in the previous layer. Specifically, a neuron input zj[l] is a biased linear combination of the outputs of the neurons in layer l − 1,
for l = 2, 3, . . . , L, j = 1, . . . , nl , (5)
where wj[lk] are the weights and bj[l] are the biases at layer l. These weights and biases are adjusted while the network is learning from the training data, but thereafter remain fixed. Defining b [l] ∈ Rnl as the vector of biases and W [l] ∈ Rnl × Rnl − 1 as the matrix of weights at layer l, with components bj[l] and wj[lk] respectively, we can write (5) as
z[l] = W [l]a[l−1] + b[l] , for l = 2, 3, . . . , L. (6)
Combining this with (3) and (2) we find
a[1] = x, (7)
a[l] = σl (z[l]) = σl (W [l]a[l−1] + b[l]) for l = 2, 3, . . . , L. (8)
Equations (7) and (8) together describe the feed-forward algorithm for evaluating the output of a neural network given an input vector x (and known weights W [l] and biases b[l]). First (7) is used to define a[1] , then (8) is used for l = 2, 3, . . . , L to obtain the neuron outputs a[2] , a[3] , . . . , a[L] in turn. The last of these, a[L], is the output of the neural network.
2.3 Training the neural network
2.3.1 Gradient descent
The feed-forward algorithm outlined in the previous section allows the output of the neural network to be evaluated from its inputs, provided that the weights and biases at each layer of the network are known. These parameters are determined by training the neural network against measurements or other data for which both the input and target output of the network are known.
This training data consists of a set of N inputs to the network, x{i}, with N corresponding target outputs, denoted y{i} (i = 1, . . . , N). We can formalise the idea of how closely our network produces the target outputs y{i} (when given the corresponding inputs x{i}) by defining a total cost function,
where ‘total cost’ refers to this cost being the sum of costs over all the N training data, and
the squared L2 norm Ⅱ · Ⅱ2(2) of a vector is the sum of its components squared,
ⅡxⅡ2(2) = x1(2) + x2(2) + · · · + xn(2) . (10)
In the cost function (9), for each training datum i the norm is taken of the difference between the target network output y{i} and the actual output of the neural network a[L](x{i}), evaluated from the corresponding network input x{i} by using the feed-forward algorithm. If the network reproduces exactly the target output for every piece of input data, the cost function is equal to its minimum possible value, zero.
For a given set of training data (x{i} , y{i}) (i = 1, …, N) , the cost C is a function of the parameters of the neural network, namely all the weights and biases for all of the layers. We could write this dependence explicitly by using (8) recursively to express a[L](x{i}) in terms of a[1] = x{i} and the weights and biases of layers 2 to L,
The network sketched above with nl = (2, 4, 4, 1) has in total 9 biases (one for each hidden and output neuron) and 28 weights (one for each connection between a pair of neurons), making the cost C dependent on 37 parameters in total. For ease of notation we denote by p ∈ RM the vector containing the weights and biases at each layer that parameterise the network, where M is the number of such parameters (in this case M = 37). We can then write C = C(p) without needing to refer to the weights and biases explicitly.
The process of training the network is equivalent to minimising the difference between the desired (target) and actual neural network outputs for the training data, or in other words, choosing p (and thus the weights and biases at each layer) so as to minimise the cost C(p), which is a nonlinear function of p.
We will solve this problem using an iterative method based on the classical technique of steepest descent. Suppose we have a current vector of parameters p and initial cost C(p), and wish to generate a new vector of parameters p + δp such that the new cost C(p + δp) is minimised. What value should we choose for δp? For small δp we can use a Taylor series to expand the new cost
where pr is the rth component of the parameter vector p, and terms of order (δp)2 and higher have been neglected. Equivalently, we can write this in vector form,
C(p + δp) ≈ C(p) + (∇C(p)) · δp, (13)
where the gradient operator ∇ acts over each of the components of p (i.e. over each weight and bias in the network),
To minimise the new weight C(p + δp) we wish to make the final term of (13) as negative as possible. The Cauchy-Schwartz inequality states that
| (∇C(p)) · δp| ≤ ∥∇C(p)∥2 ∥δp∥2 , (15)
with equality only when ∇C(p) lies in the same direction as (is a multiple of) δp. This suggests that to minimise C(p + δp) we should choose δp in the direction of −∇C(p), and so we take
δp = −η∇C(p), (16)
where the positive constant η is the learning rate. The training of the neural network therefore starts by choosing some initial parameters p for the network (we will choose these randomly), and repeatedly performing the iteration
p ← p + δp = p − η∇C(p). (17)
We would like this iteration to approach the global minimum of C(p) within a reasonable number of steps. Unfortunately, minimisation of a nonlinear function in high dimensions (recall that even in our simple network p has 37 components) is a fundamentally difficult problem. For sufficiently small η, the Taylor series approximation (13) will be valid and each iteration of (17) will decrease the cost function. However, for this to be true in general, η has be arbitrarily small. In practice, setting η to a very small value means that that p changes only very slightly at each iteration, and a very large number of steps is needed to converge to a minimum. On the other hand, setting η too large means that the approximation (13) is rarely valid and the choice of step (16) will not decrease the cost function as desired. Choosing an appropriate learning rate η is a balance between these two considerations.
Even if the iteration converges, it may converge to a local minimum of C, where ▽C = 0, but where C is nonetheless much larger than the global minimum cost minp (C(p)). Finding the global minimum of an arbitrary nonlinear function in a high-dimensional space is near impossible. We therefore abandon the goal of finding the global minimum of C(p), and instead simply look for a value of p that has a cost C(p) less than some small threshold, τ , say.
2.3.2 Stochastic gradient descent
Each step of the steepest descent iteration (17) requires an evaluation of ▽C(p) which contains the derivatives of the cost function with respect to each of the weights and biases in the network. Recalling the definition of the cost function (9), we write
▽C (18)
where the contribution to the cost from each of the training data points i is
ⅡyⅡ (19)
Evaluation of the gradient of the cost (18) can be computationally expensive, since the number of points N in the training data set may be large (many tens of thousands), and the number of elements in ▽C (the number of parameters in the weights and biases) may be several million in a large neural network. This means that it takes a long time to calculate each iteration of the steepest descent method (17).
A faster alternative is to instead calculate the increment δp at each iteration from the gradient of cost associated with a single randomly chosen training point,
p ← p + δp = p − η▽Cx{i} (p), (20)
a technique called stochastic gradient descent, or simply stochastic gradient. In stochastic gradient the reduction in the total cost function at (9) is likely to be smaller than in the steepest descent method (17) – in fact many steps are likely to increase the total cost slightly – but since many more iterations can be performed in a given time, the convergence is often quicker.
There are several ways of choosing the training data point i to be used at each step, but we will use the simplest: at each step randomly select (with replacement) a training data point i independently of previous selections.
2.3.3 Evaluating ▽Cx{i} (p): Back-propagation
To use the stochastic gradient algorithm (20) we must evaluate ▽Cx{i} (p), the derivative of the cost function associated with a single training data point with respect to each of the
network weights and biases,
for l = 2, . . . , L, (21)
for l = 2, . . . , L. (22)
The dependence of the cost Cx{i} on the weights and biases seems rather complicated, so an obvious temptation is to evaluate the derivatives by finite-differencing. This is possible, easy to implement and highly recommended for validation/debugging purposes (see below). However the approach is also extremely expensive and not viable for practical computations with big neural networks.
It turns out that the recursive nature of the network helps to simplify the analytical (and efficient) calculation of the required derivatives through an algorithm called back- propagation. Details can be found in the paper by Higham and Higham (2019). The key quantities required are known (for somewhat obscure reasons) as the errors, defined as
for 1 ≤ j ≤ nl
and 2 ≤ l ≤ L. (23)
They represents the derivative of the cost for our chosen training point i with respect to the input z to the jth neuron at layer l.
Higham and Higham (2019) show that the errors can be computed by moving backwards through the network, starting with the output layer, l = L, for which
δ [L] = σL(′)(z[L]) ◦ (a[L] — y{i}) , (24)
where the operator ◦ is the componentwise (Hadamard) product, defined by
u ◦ v = (u1 v1 , u2 v2, . . . , un vn ) for u, v ∈ Rn. (25)
The Hadamard product can be avoided by writing equation (24) using a product with a diagonal matrix,
Having established the errors for the output layer (l = L), we then move backwards through the network using the recurrence
δ [l] = σl(′) (z[l]) ◦ (W[l+1])T δ [l+1] for l = L — 1 . . . , 2, (27)
which can again be written as
δ [l] = (I σl(′) (z1[l]) σl(′) (z2[l]) . . . for l = L — 1, . . . , 2.
σl(′) (zn[ll] ) ,
Note how this produces the errors in layer l from those already computed in layer l + 1.
The real punchline is that the derivatives (21) and (22) can be expressed in terms of the
the errors as
for |
l = 2, . . . , L, (29) |
|
for l = 2, . . . , L. (30) |
This is not only very neat (much neater than one may have expected, given the complicated structure of the network) but also much more efficient than the evaluation of the derivatives by finite-differencing.
Reviews
There are no reviews yet.