[SOLVED] R C algorithm deep learning math scala database graph statistic network CHAPTER 23 Linear Regression

$25

File Name: R_C_algorithm_deep_learning_math_scala_database_graph_statistic_network_CHAPTER_23_Linear_Regression.zip
File Size: 942 KB

5/5 - (1 vote)

CHAPTER 23 Linear Regression
Given a set of attributes or variables X1,X2, ,Xd, called the predictor, explanatory, or independent variables, and given a realvalued attribute of interest Y, called the response or dependent variable, the aim of regression is to predict the response variable based on the independent variables. That is, the goal is to learn a regression function f , such that
Yf X1 , X2 ,, Xd f X
where XX1 , X2 ,, Xd T is the multivariate random variable comprising the predictor attributes, andis a random error term that is assumed to be independent of X. In other words, Y is comprised of two components, one dependent on the observed predictor attributes, and the other, coming from the error term, independent of the predictor attributes. The error term encapsulates inherent uncertainty in Y, as well as, possibly the effect of unobserved, hidden or latent variables.
In this chapter we discuss linear regression, where the regression function f is assumed to be a linear function of the parameters of the model. We also discuss regularized linear regression models considering both L2 ridge regression and L1 Lasso regularization. Finally, we use the kernel trick to perform kernel ridge regression that can handle nonlinear models.
23.1 LINEAR REGRESSION MODEL
In linear regression the function f is assumed to be linear in its parameters, that is
d i1
Here, the parameteris the true unknown bias term, the parameter i is the true unknownregressioncoefficientorweightforattributeXi,and1,2,,dT is the true ddimensional weight vector. Observe that f specifies a hyperplane in Rd1, whereis the the weight vector that is normal or orthogonal to the hyperplane, and
fX1X1 2X2 dXd
iXi TX 23.1
589

590 Linear Regression
is the intercept or offset term see Section 6.1. We can see that f is completely specified by the d1 parameters comprisingand i , for i1,, d .
The true bias and regression coefficients are unknown. Therefore, we have to estimate them from the training dataset D comprising n points xiRd in a ddimensional space, and the corresponding response values yiR, for i1,2, ,n. Let b denote the estimated value for the true bias , and let wi denote the estimated value for the true regression coefficient i , for i1, 2,, d . Let ww1 , w2 ,, wd T denote the vector of estimated weights. Given the estimated bias and weight values, we can predict the response for any given input or test point xx1,x2, ,xdT, as follows
y bw1x1 wdxd bwTx 23.2
Due to the random error term, the predicted value y will not in general match the observed response y for the given input x. This is true even for the training data. The difference between the observed and predicted response, called the residual error, is given as
oyyybw T x2 3 . 3
Note that the residual error o is different from the random statistical error , which measures the difference between the observed and the unknown true response. The residual error o is an estimator of the random error term .
A common approach to predicting the bias and regression coefficients is to use the method of least squares. That is, given the training data D with points xi and response values yi for i1, ,n, we seek values b and w, so as to minimize the sum of squared residual errors SSE
nn2n 2
SSE oi2yi yiyi bwTxi 23.4
i1 i1 i1
In the following sections, we will estimate the unknown parameters, by first considering the case of a single predictor variable, and then looking at the general case of multiple predictors.
23.2 BIVARIATE REGRESSION
Let us first consider the case where the input data D comprises a single predictor attribute, Xx1,x2, ,xnT, along with the response variable, Yy1,y2, ,ynT. Since f is linear, we have
yi fxibwxi 23.5
Thus, we seek the straight line fx with slope w and intercept b that best fits the data. The residual error, which is the difference between the predicted value also

Bivariate Regression 591
called fitted value and the observed value of the response variable, is given as
o iy iy i2 3 . 6
Note that oidenotes the vertical distance between the fitted and observed response. The best fitting line minimizes the sum of squared errors
n n n
minSSE oi2yi yi2yi bwxi2 23.7
b,w
i1 i1 i1
To solve this objective, we differentiate it with respect to b and set the result to 0,
to obtain
n
bSSE2 yi bwxi0
i1
n n n
b yiw xi i1 i1 i1
1 n
bn
i1
1 n
y iwn bY wX
x i
Therefore, we have
i1
23.8 where Y is the sample mean for the response and X is the sample mean for the
predictor attribute. Similarly, differentiating with respect to w, we obtainn
wSSE2 xiyi bwxi0 i1
n n n
x iy ib x iw x i20
i1 i1 i1 substituting b from Eq. 23.8, we have
n n n n
x iy i Y x iw X x iw x i20
i1 i1 i1 i1
2 3 . 9
n nw xi2 n2Xxi yi
nX Y
23.10
i1 i1 ni1 xi yinX Y
w n x2n2 i1 i X
The regression coefficient w can also be written as
ni1xi Xyi Y XY covX,Y w ni1xiX2X2varX
23.11

592 Linear Regression where X2 is the variance of X and XY is the covariance between X and Y. Noting that
the correlation between X and Y is given as XYXY , we can also write w as X Y
w Y XY X
23.12
Observe that the fitted line must pass through the mean value of Y and X; plugging in the optimal value of b from Eq. 23.8 into the regression equation Eq. 23.5, we have
yi bwxi Y wX wxi Y wxi X
That is, when xiX, we have yiY. Thus, the point X,Y lies on the regression
line.
Example23.1BivariateRegression. Figure23.1showsthescatterplotbetweenthe two attributes petal length X; the predictor variable and petal width Y; the response variable in the Iris dataset. There are a total of n150 data points. The mean values for these two variables are
X150
1 1 5 0
Y150
1 1 5 0
5 6 3 . 8
xi150 3.7587
1 7 9 . 8
yi150 1.1987
i1 i1
2.5 2.0 1.5 1.0 0.5

0 0.5

0 1.0
2.0
3.0
X: petal length
6.0
7.0

o35

x35

o9
x9

Figure 23.1. Scatterplot: petal length X versus petal width Y. Solid circle black shows the mean point; residual error is shown for two sample points: x9 and x35.
4.0 5.0
Y: petal width

Bivariate Regression 593
The variance and covariance is given as
1 1 5 0
X21 50
1 1 5 0
xiX23.0924
yi Y2 0.5785
Y21 50
1 1 5 0
i1
i1 i1
xi Xyi Y1.2877
Assuming a linear relationship between the response and predictor variables, we use
XY150
Eq. 23.8 and Eq. 23.10 to obtain the slope and intercept terms as follows
wXY1.28770.4164 X2 3.0924
bY wX 1.19870.41643.75870.3665 Thus, the fitted regression line is
y0.36650.4164x
Figure 23.1 plots the bestfitting or regression line; we can observe that the mean point X,Y3.759,1.199 lies on the line. The figure also shows the residual errors, o9 and o35, for the points x9 and x35, respectively.
Finally, we can compute the SSE value see Eq. 23.4 as follows:
150 150
SSEoi2 yi yi2 6.343
i1 i1
23.2.1 GeometryofBivariateRegression
We now turn to the attributecentric view, which provides important geometric insight for bivariate regression. Recall that we are given n equations in two unknowns, namely yibwxi, for i1, ,n. Let Xx1,x2, ,xnT be the ndimensional vector denoting the training data sample, Yy1 , y2 ,, yn T the sample vector for the response variable, and Yy1,y2, ,ynT the vector of predicted values, then we can expressthenequations,yi bwxi fori1,2,,n,asasinglevectorequation:
Yb1wX
2 3 . 1 3
where 1Rn is the ndimensional vector of all ones. This equation indicates that the predicted vector Y is a linear combination of 1 and X, i.e., it must lie in the column space spanned by 1 and X, given as span1,X. On the other hand, the response vector Y will not usually lie in the same column space. In fact, the residual error vector

594
Linear Regression
x1 Y
oYY
1
xn
ndimensional space spanned by the n data points. The plane illustrates the subspace spanned by 1 and X.
Y
Figure 23.2. Geometry of bivariate regression: nonorthogonal basis. All the vectors conceptually lie in the
X
x2
X
X
1

X 1
Figure 23.3. Orthogonal decomposition of X into X and X1.
oo1,o2,,onT capturesthedeviationbetweentheresponseandpredictedvectors
oYY
The geometry of the problem, shown in Figure 23.2, makes it clear that the optimal Y that minimizes the error is the orthogonal projection of Y onto the subspace spanned by 1 and X. The residual error vector o is thus orthogonal to the subspace spanned by 1 and X, and its squared length or magnitude equals the SSE value see Eq. 23.4, since
n n o2 YY2yi yi2
i1 i1
oi2 SSE
At this point it is worth noting that even though 1 and X are linearly independent and form a basis for the column space, they need not be orthogonal see Figure 23.2. We can create an orthogonal basis by decomposing X into a component along 1 and a component orthogonal to 1 as shown in Figure 23.3. Recall that the scalar projection

XX 1

Bivariate Regression
595
x1
Y
oYY
1
Y xn
X
x2
Figure 23.4. Geometry of bivariate regression: orthogonal basis. XXX1 is the centered attribute vector. The plane illustrates the subspace spanned by the orthogonal vectors 1 andX.
of a vector b onto vector a see Eq. 1.12 is given as projabbTa
aTa
and the orthogonal projection of b on a see Eq. 1.11 is given as
projababTaa aTa
where XXX1 is the centered attribute vector, obtained by subtracting the mean X from all points.
The two vectors 1 and X form an orthogonal basis for the subspace. We can thus obtain the predicted vector Y by projecting Y onto 1 andX, and summing up these two components, as shown in Figure 23.4. That is,
Yproj Y1proj YXYT11YTXX 1YTXX 23.14 1 X 1T1 XTX Y XTX
On the other hand, from Eq. 23.13, we know that
Yb1wXb1w X 1X bwX1wX 23.15
Since both Eq. 23.14 and Eq. 23.15 are expressions for Y, we can equate them to obtain
bw or b w w YTX YXYX XTX
Now, consider the projection of X onto 1; we have
p r o j 1X 1 X T 1 1ni1 x i 1 X1
Thus, we can rewrite X as
1T1 n
XX 1XX 1X 1X
Y

w

596 Linear Regression where the bias term bYwX matches Eq. 23.8, and the weight w also matches
Eq. 23.10, since
Y T X Y T X Y TX X1ni1 y i x i n X Y
wXTXX2XX12ni1xi2n2X
Example 23.2 Geometry of Regression. Let us consider the regression of petal length X on petal width Y for the Iris dataset, with n150. First, we center X by subtracting the mean X3.759. Next, we compute the scalar projections of Y onto 1 andX, to obtain
Yproj1YYT1179.81.1987 1T1 150
wprojXYYTX 193.16 0.4164 T
Thus, the bias term b is given as
bY wX 1.19870.41643.75870.3665
These values for b and w match those in Example 23.1. Finally, we can compute the SSE value see Eq. 23.4 as the squared length of the residual error vector
SSEo2 YY2 YYTYY6.343
X X 463.86
23.3 MULTIPLE REGRESSION
We now consider the more general case called multiple regression1 where we have multiple predictor attributes X1,X2, ,Xd and a single response attribute Y. The training data sample DRnd comprises n points xixi1,xi2, ,xid T in a ddimensional space, along with the corresponding observed response value yi. The vector Yy1 , y2 ,, yn T denotes the observed response vector. The predicted response value for input xi is given as
yi bw1xi1 w2xi2 wdxid bwTxi 23.16
whereww1,w2,,wdT istheweightvectorcomprisingtheregressioncoefficients or weights wj along each attribute Xj . Eq. 23.16 defines a hyperplane in Rd1 with bias term b and normal vector w.
Instead of dealing with the bias b separately from the weights wi for each attribute, we can introduce a new constant valued attribute X0 whose value is always fixed at 1, so that each input point xixi1,xi2, ,xid TRd is mapped to an augmented
1 We follow the usual terminology and reserve the term multivariate regression for the case when there are multipleresponseattributesY1,Y2,,Yq andmultiplepredictorattributesX1,X2,,Xd.

Multiple Regression 597
point x ixi0,xi1,xi2, ,xid TRd1, where xi01. Likewise, the weight vector ww1,w2, ,wdT is mapped to an augmented weight vector w w0,w1,w2, ,wdT, where w0b. The predicted response value for an augmented d1 dimensional point x i can be written as
yi w0xi0 w1xi1 w2xi2 wdxid w Tx i 23.17
Since there are n points, in fact we have n such equations, one per point, and there are d1 unknowns, namely the elements of the augmented weight vector w. We can compactly write all these n equations as a single matrix equation, given as
YDw23.18 nd1
where DR is the augmented data matrix, which includes the constant attribute X0 inadditiontothepredictorattributesX1,X2,,Xd,andYy1,y2,,ynT isthe vector of predicted responses.
The multiple regression task can now be stated as finding the best fitting hyperplane defined by the weight vector wthat minimizes the sum of squared errors
2 3 . 1 9
n wi1
minSSE
oi2 o2 YY2
YYTYYYTY2YT YYT YY T Y2 Y TDw Dw TDw
Y T Y2 wTDT Y wTDT D w
where we substituted YDwfrom Eq. 23.18, and we used the fact that YT Dw Dw T YwTDT Y.
To solve the objective, we differentiate the expression in Eq. 23.19 with respect to wand set the result to 0, to obtain
SSE2D TY2D TD w 0w DT D w DT Y
Therefore, the optimal weight vector is given as
wDT D1 DT Y
Substituting the optimal value of winto Eq. 23.18, we get YD wD D TD 1D TYHY
2 3 . 2 0
where HDDTD1DT is called the hat matrix, since it puts the hat on Y! Notice also that DTDis the uncentered scatter matrix for the training data.

598
Linear Regression
X1
Y
150.0 876.50 563.80DTD 876.5 5223.85 3484.25
563.8 3484.25 2583.00 We also compute DTY, given as
0.793 0.176 0.064 DTD10.176 0.041 0.017

Figure23.5. Multipleregression:sepallengthX1andpetallengthX2withresponseattributepetal width Y. The vertical bars show the residual error for the points. Points in white are above the plane, whereas points in gray are below the plane.
Example 23.3 Multiple Regression. Figure 23.5 shows the multiple regression of sepal length X1and petal length X2on the response attribute petal width Y for the Iris dataset with n150 points. We first add an extra attribute X01150, which is a vector of all ones in R150. The augmented dataset D R1503 comprises n150 points along three attributes X0, X1, and X2.
Next,wecomputetheuncentered33scattermatrixD TDanditsinverse
179.80DTY1127.65
868.97 The augmented weight vector wis then given as
w2 0.45 The bias term is therefore bw00.014, and the fitted model is
Y0.0140.082X10.45X2
Figure 23.5 shows the fitted hyperplane. It also shows the residual error for each point. The white colored points have positive residuals i.e., oi0 or y iyi , whereas
w0
0.014 ww 1DT D1 DT Y 0 . 0 8 2

X2

0.064 0.017 0.009

Multiple Regression 599
the gray points have negative residual values i.e., oi0 or yiy. The SSE value for the model is 6.18.
23.3.1 Geometry of Multiple Regression
Let Dbe the augmented data matrix comprising the d independent attributes Xi , along
with the new constant attribute X01Rn, given as
D X0 X1 X2Xd

Let w w0 , w1 ,, wd TRd 1 be the augmented weight vector that incorporates
the bias term bw0. Recall that the predicted response vector is given as
d i0
This equation makes it clear that the predicted vector must lie in the column space of the augmented data matrix D, denoted colD, i.e., it must be a linear combination of the attribute vectors Xi, i0, ,d.
To minimize the error in prediction, Y must be the orthogonal projection of Y onto the subspace colD. The residual error vector oYY is thus orthogonal to the subspace colD, which means that it is orthogonal to each attribute vector Xi . That is,
X Ti o0
X Ti YY 0
X Ti YX Ti Y
X TiDwX Ti Y
w0 XTi X0 w1XTi X1 wd XTi Xd XTi Y
We thus have d1 equations, called the normal equations, in d1 unknowns, namely the regression coefficients or weights wi including the bias term w0. The solution to these simultaneous equations yields the weight vector w w0 , w1 ,, wd T . The d1 normal equations are
Yb1w 1X 1w 2X 2w dX d
w iX iDw
w0XT0X0w1XT0X1wd XT0Xd XT0Y w0XT1X0w1XT1X1wd XT1Xd XT1Y
.. w0XTdX0w1XTdX1wd XTdXd XTdY
23.21

600 Linear Regression which can be written compactly in matrix notation to solve for was follows
X T0 X 0 X T1 X 0. . .
X Td X 0
X T0 X 1 X T1 X 1 . . .
X Td X 1

X T0 X d
X T1 X d T
. . .w D Y
X Td X d
DT D w DT Y
w DTD1DTY
More insight can be obtained by noting that the attribute vectors comprising the
This matches the expression in Eq. 23.20.
column space of D are not necessarily orthogonal, even if we assume they are linearly

independent. To obtain the projected vector Y , we first need to construct an orthogonal
basis for colD.
Let U0,U1, ,Ud denote the set of orthogonal basis vectors for colD. We can
construct these vectors in a stepwise manner via GramSchmidt orthogonalization, as follows
where
U0X0
U1X1p10U0
U2 X2 p20 U0 p21 U1
..
Ud Xd pd0 U0 pd1 U1 pd,d1 Ud1
p j ip r o j UX j X jT U i i Ui2
23.22
denotes the scalar projection of attribute Xj onto the basis vector Ui. Essentially, to obtain Uj , we subtract from vector Xj its scalar projections along all previous basis vectors U0,U1, ,Uj1.
Rearranging the equations above, so that Xj is on the left hand side, we get
X0U0
X1p10U0U1
X2 p20 U0 p21 U1 U2
..
Xd pd0 U0 pd1 U1 pd,d1 Ud1 Ud
The GramSchmidt method thus results in the socalled QRfactorization2 of the data matrix, namely D QR, where by construction Q is a nd1 matrix withi
2 In QRfactorization, the matrix Q is orthogonal, with orthonormal columns, i.e., with orthogonal columns that are normalized to be unit length. However, we keep the basis vectors unnormalized for ease of presentation.

Multiple Regression 601 orthogonal columns
Q U0 U1 U2Ud
and R is the d1d1 uppertriangular matrix
1 p 1 0 p 2 0p d 00 1 p21pd1
R0 0 1pd2. . . .
0 0 0 1 p d , d100001
So that, in the column view the QRfactorization of the augmented data matrix is given
as:
0 1 p21pd1
1 p 1 0 p 2 0p d 0X X X XU U U U0 01pd2
0 1 2 d 0 1 2 d . . . . 0 0 0 1 pd,d1
DQ 00001

R
SincethenewbasisvectorsU0,U1,,Ud formanorthogonalbasisforthecolumn space of D, we can obtain the predicted response vector as a sum of the projections of Y along each new basis vector, given as
Yp r o j U 0Y U 0p r o j U 1Y U 1p r o j U dY U d2 3 . 2 3
Bias Term The geometric approach makes it easy to derive an expression for the bias term b. Note that each of the predictor attributes can be centered by removing its projection along the vector 1. Define Xi to be the centered attribute vector
XiXiXi1
All the centered vectors Xi lie in the n1 dimensional space, comprising the orthogonal complement of the span of 1.
From the expression of Y, we have
Yb1w 1X 1w 2X 2w dX d
b1w1X1 X1 1 wdXd Xd 1
bw1X1 wd Xd1w1X1wd Xd
23.24
On the other hand, since 1 is orthogonal to all Xi , we can obtain another expression for Y in terms of the projection of Y onto the subspace spanned by the vectors

602 Linear Regression
1,X1, ,Xd. Let the new orthogonal basis for these centered attribute vectors be U0,U1,,Ud,whereU0 1.Thus,Y canalsobewrittenas
d i1
d i1
YprojU0YU0
In particular, equating the scalar projections along 1 in Eq. 23.24 and Eq. 23.25, we
projUiYUi proj1Y1 proj1YY bw1X1 wd Xd, whichimplies
projUiYUi 23.25
get:
bY w1 X1 wd Xd Y
where we use the fact that the scalar projection of any attribute vector onto 1 yields
the mean value of that attribute. For example,
YT1 1n
proj1Y 1T1n yi Y i1
23.3.2 Multiple Regression Algorithm
The pseudocode for multiple regression is shown in Algorithm 23.1. The algorithm starts by the QRdecomposition of D QR, where Q is a matrix with orthogonal columns that make up an orthogonal basis, and R is an upper triangular matrix, which can be obtained via GramSchmidt orthogonalization. Note that, by construction, the matrix QTQ is given as
U02 00 QTQ 0 U120
0 0 0 0 0 Ud2
d i1
wi Xi 23.26
Algorithm 23.1: Multiple Regression Algorithm MULTIPLEREGRESSION D,Y:
1 D1 Daugmenteddatawith X0 1Rn
2 Q,RQRfactorizationDQU0 U1Ud
1 00 U0 2
0 10
3 1 U1 2 reciprocal squared norms
0 0 0 0 01
1 T Ud2
4 Rw Q Ysolve for w by backsubstitution
5 YQ1QTY

Multiple Regression 603
We can observe that the matrix QTQ, denoted , is a diagonal matrix that contains the squared norms of the new orthogonal basis vectors U0 , U1 ,, Ud .
Recall that the solution to multiple regression is given via the normal equations Eq. 23.21, which can be compactly written as DT ww DT Y Eq. 23.22; plugging in the QRdecomposition, we get
DT D w DT Y QRTQRw QRTY RTQTQRw RTQTY RTRw RTQTYR w Q T Y R w 1 Q T Y
Note that 1 is a diagonal matrix that records the reciprocal of the squared norms of the new basis vectors U0 , U1 ,, Ud . Furthermore, since R is uppertriangular, it is straightforward to solve for wby backsubstitution. Note also that we can obtain the predicted vector Y as follows
YDw QRR11QTYQ1QTY
It is interesting to note that 1QTY gives the vector of scalar projections of Y onto
each of the orthogonal basis vectors
projU0Y
1QTYprojU1Y.
projUd Y
Therefore, Q1QTY, yields the projection formula in Eq. 23.23
projU0Y
YQprojU1Yproj YU proj YU proj YU
.U0 0 U1 1 Ud d projUdY
Example 23.4 Multiple Regression: QRFactorization and Geometric Approach.
Consider the multiple regression of sepal length X1 and petal length X2 on the response attribute petal width Y for the Iris dataset with n150 points, as shown in Figure 23.5. The augmented dataset D R1503 comprises n150 points along three attributes X0, X1, and X2, where X01. The GramSchmidt orthogonalization results in the following QRfactorization:
1 5.843 3.759 X0 X1 X2U0 U1 U20 1 1.858
001
DQ R

604 Linear Regression Note that QR1503 and therefore we do not show the matrix. The matrix , which
records the squared norms of the basis vectors, and its inverse matrix, is given as
150 0 0 0.00667 0 00 102.17 010 0.00979 0
0 0 111.35
We can use backsubstitution to solve for w, as follows
R w 1 Q T Y
1 5.843 3.759 w01.1987 0 1 1.858 w1 0.7538 0 0 1 w2 0.4499
In backsubstitution, we start with w2, which is easy to compute from the equation above; it is simply
Next, w1 is given as:
w20.4499
w11.858w20.7538
w10.75380.83580.082 Finally, w0 can be computed as
w0 5.843w1 3.759w2 1.1987 w0 1.19870.47861.69110.0139
Thus, the multiple regression model is given as
Y 0 . 0 1 4X 00 . 0 8 2X 10 . 4 5X 2
2 3 . 2 7
which matches the model in Example 23.3.
It is also instructive to construct the new basis vectors U0,U1, ,Ud in terms of
theoriginalattributesX0,X1,,Xd.SinceDQR,wehaveQD R1.Theinverse of R is also uppertriangular, and is given as
1 5.843 7.095R10 1 1.858
001 Therefore, we can write Q in terms of the original attributes as
1 5.843 7.095U0 U1 U2X0 X1 X20 1 1.858
001
Q DR1

0 0 0.00898

Multiple Regression 605
which results in
U0X0
U1 5.843X0X1
U2 7.095X01.858X1X2
The scalar projection of the response vector Y onto each of the new basis vectors yields:
projU0Y1.199 projU1Y0.754 projU2Y0.45 Finally, the fitted response vector is given as:
YprojU0YU0projU1YU1projU2YU2
1.199X00.7545.843X0X10.457.095X01.858X1X2
1.1994.4063.193X00.7540.836X10.45X20.014X00.082X10.45X2
which matches Eq. 23.27.
23.3.3 Multiple Regression: Stochastic Gradient Descent
Instead of using the QRfactorization approach to exactly solve the multiple regression problem, we can also employ the simpler stochastic gradient algorithm. Consider the SSE objective given in Eq. 23.19 multiplied by 12:
minSSE 1YTY2w TD TYw TD TD ww2
The gradient of the SSE objective is given as
wSSED TYD TD w w
23.28
Using gradient descent, starting from an initial weight vector estimate w0, we can iteratively update was follows
w t1w tww tD TYD w t
where wt is the estimate of the weight vector at step t.
In stochastic gradient descent SGD, we update the weight vector by considering
only one random point at each time. Restricting Eq. 23.28 to a single point x k in the training data D, the gradient at the point x k is given as
wx kx kykx kx Tk w ykx Tk wx k

606
Linear Regression
Algorithm 23.2: Multiple Regression: Stochastic Gradient Descent MULTIPLE REGRESSION: SGD D,Y,,o:
1 2 3 4 5 6 7
8 9
D 1 Daugment data
t0stepiteration counter
wtrandom vector in Rd1initial weight vector repeat
foreach k1,2, ,n in random order do
w xk y kxTk wt xk c o m p u t e g r a d i e n t a t xk wt1 wt w x kupdateestimatefor w
tt1 untilwtwt1o
Therefore, the stochastic gradient update rule is given as wt1wt w xk
wt y kxTk wt xk
Algorithm 23.2 shows the pseudocode for the stochastic gradient descent algorithm for multiple regression. After augmenting the data matrix, in each iteration it updates the weight vector by considering the gradient at each point in random order. The method stops when the weight vector converges based on the tolerance o.
Example 23.5 Multiple Regression: SGD. We continue Example 23.4 for multiple regression of sepal length X1and petal length X2on the response attribute petal width Y for the Iris dataset with n150 points.
Using the exact approach the multiple regression model was given as Y 0 . 0 1 4X 00 . 0 8 2X 10 . 4 5X 2
Using stochastic gradient descent we obtain the following model with 0.001 and o0.0001:
Y 0 . 0 3 1X 00 . 0 7 8X 10 . 4 5X 2
The results from the SGD approach as essentially the same as the exact method, with a slight difference in the bias term. The SSE value for the exact method is 6.179, whereas for SGD it is 6.181.
23.4 RIDGE REGRESSION
We have seen that for linear regression, the predicted response vector Y lies in the span of the column vectors comprising the augmented data matrix D. However, often the data is noisy and uncertain, and therefore instead of fitting the model to the data

Ridge Regression 607
exactly, it may be better to fit a more robust model. One way to achieve this is via r e g u l a r i z a t i o n , w h e r e w e c o n s t r a i n t h e s o l u t i o n v e c t o r wt o h a v e a s m a l l n o r m . In o t h e r words, instead of trying to simply minimize the squared residual error YY 2 , we add a regularization term involving the squared norm of the weight vector w2 as follows:
min Jw YY2w 2 YD w 2w 2 23.29 w
Here 0 is a regularization constant that controls the tradeoff between minimizing the squared norm of the weight vector and the squared error. Recall that w2di1wi2 is the L2norm of w . For this reason ridge regression is also called L2 regularized regression. When 0, there is no regularization, but asincreases there
is more emphasis on minimizing the regression coefficients.
The solve the new regularized objective we differentiate Eq. 23.29 with respect
23.30 23.31
23.32
to wand set the result to 0 to obtain
Jw YD w 2w 20
w w
YTY2w TD TYw TD TD ww Tw 0
wT T
2D Y2D Dw2w0
D TDIwD TY Therefore, the optimal solution is
wD TDI1D TY
where IRd 1d 1 is the identity matrix. The matrix DT D I is always invertible or nonsingular for 0 even if DT Dis not invertible or singular. This is because ifi isaneigenvalueofD TD ,theni isaneigenvalueofD TDI.SinceD TDis positive semidefinite it has nonnegative eigenvalues. Thus, even if an eigenvalue of D TDiszero,e.g.,i0,thecorrespondingeigenvalueofD TD Iisi0.
Regularized regression is also called ridge regression since we add a ridge along the main diagonal of the DTDmatrix, i.e., the optimal solution depends on the regularized matrix DTD I. Another advantage of regularization is that if we choose a small positivewe are always guaranteed a solution.
Example 23.6 Ridge Regression. Figure 23.6 shows the scatterplot between the two attributes petal length X; the predictor variable and petal width Y; the response variable in the Iris dataset. There are a total of n150 data points. The uncentered scatter matrix is given as
DTD 150.0 563.8563.8 2583.0

608
Linear Regression
2.5 2.0 1.5 1.0 0.5
0 0.5

0 10
100

0 1.0
Figure 23.6. Scatterplot: petal length X versus petal width Y. Ridge regression lines for
0, 10, 100.
2.0
3.0
X: petal length
6.0
7.0
4.0 5.0
Using Eq. 23.32 we obtain different lines of best fit for different values of the regularization constant :
0 :Y0.3670.416X, w20.367, 0.416T20.308, SSE6.34 10 :Y0.2440.388X, w20.244, 0.388T20.210, SSE6.75 100 :Y0.0210.328X, w20.021, 0.328T20.108, SSE9.97
Figure 23.6 shows these regularized regression lines. We can see that asincreases there is more emphasis on minimizing the squared norm of w. However, since w2 is more constrained asincreases, the fit of the model decreases, as seen from the increase in SSE values.
Unpenalized Bias Term Often in L2 regularized regression we do not want to penalize the bias term w0, since it simply provides the intercept information, and there is no real justification to minimize it. To avoid penalizing the bias term, consider the new regularized objective with ww1,w2, ,wd T, and without w0, given as
min JwYw0 1Dw2 w2 23.33
w
d 2 dYw 01w iX i w i2
i1 i1
Y: petal width

Ridge Regression
609
Recall from Eq. 23.26 that the bias w0b is given as
d
w0 bYwi Xi Y Tw
i1
where X1 , X2 ,, Xd T is the multivariate mean of
Substituting w0 into the new L2 objective in Eq. 23.33, we get min JwYw0 1Dw2 w2
unaugmented D.
w
Therefore, we have
YY Tw1Dw2 w2 YY 1D1Tw2 w2
min JwYDw2w2 w
whereYYY1isthecenteredresponsevector,andDD1T isthecentered data matrix. In other words, we can exclude w0 from the L2 regularization objective by simply centering the response vector and the unaugmented data matrix.
23.34
Example 23.7 Ridge Regression: Unpenalized Bias. We continue from Exam ple 23.6. When we do not penalize w0, we obtain the following lines of best fit for different values of the regularization constant :
0:Y0.3650.416X 10:Y0.3330.408X 1 0 0 : Y 0 . 0 8 90 . 3 4 3X
w02 w12 0.307 w02 w12 0.277 w 02w 120 . 1 2 5
SSE6.34 SSE6.38 S S E8 . 8 7
From Example 23.6, we observe that for 10, when we penalize w0, we obtain the following model:
10:Y0.2440.388X w02 w12 0.210 SSE6.75 As expected, we obtain a higher bias term when we do not penalize w0.
23.4.1 Ridge Regression: Stochastic Gradient Descent
Instead of inverting the matrix DT D I as called for in the exact ridge regression solution in Eq. 23.32, we can employ the stochastic gradient descent algorithm. Consider the gradient of the regularized objective given in Eq. 23.30, multiplied by 12 for convenience; we get:
wJw D TYD TD w w23.35w

610
Linear Regression
Algorithm 23.3: Ridge Regression: Stochastic Gradient Descent RIDGE REGRESSION: SGD D,Y,,o:
1 2 3 4 5 6 7
8 9
D 1 Daugment data
t0stepiteration counter
wtrandom vector in Rd1initial weight vector repeat
foreach k1,2, ,n in random order do w x kykx Tkw tx k w computegradientatx k
n
wt1 wt w x kupdateestimatefor w
tt1 untilwtwt1o
Using batch gradient descent, we can iteratively compute was follows w t1w tw1w tD TYD w t
In stochastic gradient descent SGD, we update the weight vector by considering only one random point at each time. Restricting Eq. 23.35 to a single point x k, the gradient at the point x k is given as
wx kx kyk x kx Tk w wyk x Tk w x kw23.36 nn
Here, we scale the regularization constantby dividing it by n, the number of points in the training data, since the original ridge valueis for all the n points, whereas we are now considering only one point at a time. Therefore, the stochastic gradient update rule is given as
wt1wt w xk1wt y kxTk wt xk n
Algorithm 23.3 shows the pseudocode for the stochastic gradient descent algorithm for ridge regression. After augmenting the data matrix, in each iteration it updates the weight vector by considering the gradient at each point in random order. The method stops when the weight vector converges based on the tolerance o. The code is for the penalized bias case. It is easy to adapt it for unpenalized bias by centering the unaugmented data matrix and the response variable.
Example 23.8 Ridge Regression: SGD. We apply ridge regression on the Iris dataset n150, using petal length X as the independent attribute, and petal width Y as the response variable. Using SGD with 0.001 and o0.0001 we obtain different lines of best fit for different values of the regularization constant , which essentially match the results from the exact method in Example 23.6:
0 : Y 0 . 3 6 60 . 4 1 3X S S E6 . 3 7 10:Y0.2440.387X SSE6.76
1 0 0 : Y 0 . 0 2 20 . 3 2 7X S S E1 0 . 0 4

Kernel Regression 611
23.5 KERNEL REGRESSION
We now consider how to generalize linear regression to the nonlinear case, i.e., finding a nonlinear fit to the data to minimize the squared error, along with regularization. For this we will adopt the kernel trick, i.e., we will show that all relevant operations can be carried out via the kernel matrix in input space.
Letcorrespond to a mapping from the input space to the feature space, so that each point xi in feature space is a mapping for the input point xi. To avoid explicitly dealing with the bias term, we add the fixed value 1 as the first element ofxito obtain the augmented transformed point xi T1xi T . Let D denote the augmented dataset in feature space, comprising the transformed pointsxi for i1, 2,, n. The augmented kernel function in feature space is given as
K xi,xj xiT xj1xiTxj1Kxi,xj
where Kxi , xjis the standard, unaugmented kernel function.
Let Y denote the observed response vector. Following Eq. 23.18, we model the
predicted response as
YD w 2 3 . 3 7
where wis the augmented weight vector in feature space. The first element of wdenotes the bias in feature space.
For regularized regression, we have to solve the following objective in feature space:
min Jw YY2w 2YD w 2w 2 23.38 w
where 0 is the regularization constant.
Taking the derivative of Jw with respect to wand setting it to the zero vector, we get
Jw YD w 2w 20w w
YTY2w TD TYw TD TD ww Tw 0wT T
2DY2DDw2w0 w DT Y DT Dw
w DT 1YD w

n
w DT cc i x i
2 3 . 3 9wherecc1,c2,,cnT1YD w .Eq.23.39indicatesthattheweightvectorw
is a linear combination of the feature points, with c specifying the mixture coefficients for the points.
i1

612
Linear Regression
Rearranging the terms in the expression for c, we have c1YD w

cYD w
Now, plugging in the form of wfrom Eq. 23.39 we get cYDDT c
D DTIcY
c D DTI 1 Y
The optimal solution is therefore given as
cKI1Y
23.40
23.41 where IRnn is the nn identity matrix, and D DT is the augmented kernel matrix
K, since
D D TxiT xji,j1,2,,n K xi,xji,j1,2,,n K
Putting it all together, we can substitute Eq. 23.41 and Eq. 23.39 into Eq. 23.37 to obtain the expression for the predicted response
YD w D DT c
T1DD KI Y
1 K KI Y
23.42
where KK I1 is the kernel hat matrix. Notice that using 0 ensures that the inverse always exists, which is another advantage of using kernel ridge regression, in addition to the regularization.
Algorithm 23.4 shows the pseudocode for kernel regression. The main step is to compute the augmented kernel matrix K Rnn, and the vector of mixture coefficients cRn. The predicted response on the training data is then given as YKc. As for prediction for a new test point z, we use Eq. 23.39 to predict the response, which is given as:
nzT c i x i

y zT wzT DT c
i1
ci K z,xicTK z
n

i1
cizT xi
n i1

Kernel Regression 613
Algorithm 23.4: Kernel Regression Algorithm KERNELREGRESSION D,Y,K,:
1 KKxi,xj i,j1,,nstandard kernel matrix
2 K K1 augmented kernel matrix
3 cK I1 Ycompute mixture coefficients
4 YKc
TESTING z,D,K,c:
5 K z 1Kz,xixiD
6 ycTK z
That is, we compute the vector Kz comprising the augmented kernel values of z with respect to all of the data points in D, and take its dot product with the mixture coefficient vector c to obtain the predicted response.
Example23.9. ConsiderthenonlinearIrisdatasetshowninFigure23.7,obtainedvia a nonlinear transformation applied to the centered Iris data. In particular, the sepal length A1and sepal width attributes A2were transformed as follows:
XA2
Y0.2A21A20.1A1A2
We treat Y as the response variable and X is the independent attribute. The points show a clear quadratic nonlinear relationship between the two variables.
1.5
1.0
0.5
0
0.5
1.5 1.0 0.5
0 0.5 1.0 1.5 X

Figure 23.7. Kernel regression on nonlinear Iris dataset.
Y

614 Linear Regression We find the lines of best fit using both a linear and an inhomogeneous quadratic
kernel, with regularization constant 0.1. The linear kernel yields the following fit
Y0 . 1 6 8X
On the other hand, using the quadratic inhomogeneous kernel over X comprising constant 1, linear X, and quadratic terms X2, yields the fit
Y0.0860.026X0.922X2
The linear in gray and quadratic in black fit are both shown in Figure 23.7. The SSE error, YY2, is 13.82 for the linear kernel and 4.33 for the quadratic kernel. It is clear that the quadratic kernel as expected gives a much better fit to the data.
Example 23.10 Kernel Ridge Regression. Consider the Iris principal components dataset shown in Figure 23.8. Here X1 and X2 denote the first two principal components. The response variable Y is binary, with value 1 corresponding to Irisvirginica points on the top right, with Y value 1 and 0 corresponding to Irissetosa and Irisversicolor other two groups of points, with Y value 0.
X1

X1

Y

X2
a Linear Kernel
Y

X2
b Inhomogeneous Quadratic Kernel

Figure 23.8. Kernel ridge regression: linear and inhomogeneous quadratic kernels.

L1 Regression: Lasso 615
Figure 23.8a shows the fitted regression plane using a linear kernel with ridge value 0.01:
Y0.3330.167X10.074X2
Figure 23.8b shows the fitted model when we use an inhomogeneous quadratic
kernel with 0.01:
Y 0 . 0 30 . 1 6 7X 10 . 1 8 6X 20 . 0 9 2X 210 . 1X 1X 20 . 0 2 9X 2 2
The SSE error for the linear model is 15.47, whereas for the quadratic kernel it is 8.44, indicating a better fit for the training data.
23.6 L1 REGRESSION: LASSO
The Lasso, which stands for least absolute selection and shrinkage operator, is a regularization method that aims to sparsify the regression weights. Instead of using the L2 or Euclidean norm for weight regularization as in ridge regression see Eq. 23.34, the Lasso formulation uses the L1 norm for regularization
23.43
1 2
min JwYDw w1
w2
where 0 is the regularization constant and
d w1wi
i1
Note that, we have added the factor 1 for convenience; it does not change the objective.
2
Furthermore, we assume that the data comprising the d independent attributes X1, X2,
. . . , Xd , and the response attribute Y, have all been centered. That is, we assume that DD1T
YYY1
where1Rn isthevectorofallones,X1,X2,,XdT isthemultivariatemean for the data, and Y is the mean response value. One benefit of centering is that we do not have to explicitly deal with the bias term bw0 , which is important since we usually do not want to penalize b. Once the regression coefficients have been estimated, we can obtain the bias term via Eq. 23.26, as follows:
d j1
The main advantage of using the L1 norm is that it leads to sparsity in the solution vector. That is, whereas ridge regression reduces the value of the regression coefficients
bw0Y
wjXj

616
Linear Regression
3 2 1 0
1 2
Figure 23.9. Absolute value function in black and two of its subgradients in gray.
wi , they may remain small but still nonzero. On the other hand, L1 regression can drive the coefficients to zero, resulting in a more interpretable model, especially when there are many predictor attributes.
T h e L a s s o o b j e c t i v e c o m p r i s e s t w o p a r t s , t h e s q u a r e d e r r o r t e r mYD w2 w h i c h is convex and differentiable, and the L1 penalty term w1 di1wi, which is convex but unfortunately nondifferentiable at wi0. This means that we cannot simply compute the gradient and set it to zero, as we did in the case of ridge regression. However, these kinds of problems can be solved via the generalized approach of subgradients.
23.6.1 Subgradients and Subdifferential Consider the absolute value function f : RR
fww
which is plotted in Figure 23.9 black line.
When w0, we can see that its derivative is fw1, and when w0, its
derivative is f w1. On the other hand, there is a discontinuity at w0 where the derivative does not exist.
However, we use the concept of a subgradient that generalizes the notion of a derivative. For the absolute value function, the slope m of any line that passes through w0 that remains below or touches the graph of f is called a subgradient of f at w0. Figure 23.9 shows two such subgradients, namely the slopes m0.5 and m0.25. The corresponding lines are shown in gray. The set of all the subgradients at w is called the subdifferential, denoted as w. Thus, the subdifferential at w0 is given as
w1,1
since only lines with slope between 1 and 1 remain below or partially coincide with the absolute value graph.
m0.25
m0.5
3 2 1 0 1 2 3 w
w

L1 Regression: Lasso 617 Considering all the cases, the subdifferential for the absolute value function is
given as:
1 iff w0
w1 iff w0 23.44
1,1 iff w0
We can see that when the derivative exists, the subdifferential is unique and corresponds to the derivative or gradient, and when the derivative does not exist the subdifferential corresponds to a set of subgradients.
23.6.2 Bivariate L1 Regression
We first consider bivariate L1 regression, where we have a single independent attribute X and a response attribute Y both centered. The bivariate regression model is given
as
y iwx i
The Lasso objective from Eq. 23.43 can then be written as
1 n
min Jw yi wxi2 w
23.45
w 2 i1
We can compute the subdifferential of this objective as follows:
1 n
Jw 22yi wxixiw
n
i1
x iy iw
n i1
x i2w
i1
XTYwX2 w
We can solve for w by setting the subdifferential to zero; we get
Jw0
wX2w X TY wwXTY
23.46
where 1X20 is a scaling constant.
Corresponding to the three cases for the subdifferential of the absolute value
function in Eq. 23.44, we have three cases to consider: CaseIw0andw1: Inthiscaseweget
wX TY
S i n c e w0 , t h i s i m p l i e s X T Y o rX T Y.

618 Linear Regression CaseIIw0andw1: Inthiscasewehave
wX TY
S i n c e w0 , t h i s i m p l i e s X T Yo rX T Y.
CaseIIIw0andw1,1: Inthiscase,weget
w X TY , X TY
H o w e v e r , s i n c e w0 , t h i s i m p l i e s X TY0 a n d X TY0 . I n o t h e r w o r d s , X T Y a n d X T Y. T h e r e f o r e ,X T Y.
Let 0 be some fixed value. Define the softthreshold function S : RR as follows:
S zsignzmax0, z23.47 Then the above three cases can be written compactly as:
wS X TY 2 3 . 4 8with , where w is the optimal solution to the bivariate L1 regression problem
in Eq. 23.45.
23.6.3 Multiple L1 Regression
Consider the L1 regression objective from Eq. 23.43 1d 2
min Jw Y wiXi w1 w 2i1
1d dd d 2 YTY2 wiXiTY wiwjXiTXjwi
i1 i1 j1 i1
We generalize the bivariate solution to the multiple L1 formulation by optimizing for each wk individually, via the approach of cyclical coordinate descent. We rewrite the L1 objective by focusing only on the wk terms, and ignoring all terms not involving wk , which are assumed to be fixed:
1 d
min JwkwkXkTY wk2Xk2wk wjXkTXjwk 23.49
wk 2 jk

L1 Regression: Lasso 619 Setting the subdifferential of Jwkto zero, we get
Jwk0
d
w k X k2w k X kT Yw k X k2w k X kT Y
j1 wkXk2wkwkXkT2XkT YDw
We can interpret the above equation as specifying an iterative solution for wk . In essence, we let the wk on the left hand side be the new estimate for wk, whereas we treat the wk on the right hand side as the previous estimate. More concretely, let w t represent the weight vector at step t, with wkt denoting the estimate for wk at time t. The new estimate for wk at step t1 is then given as
t1 1 t1 t 1 T t wk Xk2 wk wkXk2 Xk YDw
wt1 wt1wt XTYDwt k kkk
j k d
w jX kT X j
w jX kT X jw k X kT X k
23.50 where 1Xk20 is just a scaling constant. Based on the three cases for wt1
and the subdifferential wt1, following a similar approach as in the bivariate case k
Eq. 23.48, the new estimate for wk can be written compactly as t1 t T t
The pseudocode for L1 regression is shown in Algorithm 23.5. The algorithm starts with a random estimate for w at step t0, and then cycles through each dimension to estimate wk until convergence. Interestingly, the term XkTYDwt is in fact the gradient at wk of the squared error term in the Lasso objective, and thus the update equation is the same as gradient descent with step size , followed by the softthreshold operator. Note also that sinceis just a positive scaling constant, we make it a parameter of the algorithm, denoting the step size for gradient descent.
k
wk S wkXk YDw 23.51
Example 23.11 L1 Regression. We apply L1 regression to the full Iris dataset with n150 points, and four independent attributes, namely sepalwidth X1, sepallength X2, petalwidth X3, and petallength X4. The Iris type attribute comprises the response variable Y. There are three Iris types, namely Irissetosa, Irisversicolor, and Irisvirginica, which are coded as 0, 1 and 2, respectively.

620
Linear Regression
Algorithm 23.5: L1 Regression Algorithm: Lasso L1REGRESSION D,Y,,,o:
1 2
3 4 5 6 7 8 9 10
11 12 13
meanDcompute mean DD1Tcenterthedata
YYY 1 center the response
t0stepiteration counter
w trandom vector in Rdinitial weight vector repeat
foreach k1,2, ,d do
w ktX kTYD w tc o m p u t e g r a d i e n t a t w k
wt1 wt wtupdateestimatefor w kkk k
wt1S wt1apply softthreshold function k k
tt1
untilwtwt1o
bY wtTcomputethebiasterm
The L1 regression estimates for different values ofwith 0.0001 are shown below:
0: Y0.1920.109X10.045X20.226X30.612X4
1: Y0.0770.076X10.015X20.253X30.516X4
5: Y0.5530.0X10.0X20.359X30.170X4 SSE8.82
10: Y0.5750.0X10.0X20.419X30.0X4
The L1 norm values for the weight vectors excluding the bias term are 0.992, 0.86, 0.529, and 0.419, respectively. It is interesting to note the sparsity inducing effect of Lasso, as observed for 5 and 10, which drives some of the regression coefficients to zero.
We can contrast the coefficients for L2 ridge and L1 Lasso regression by comparing models with the same level of squared error. For example, for 5, the L1 model has SSE8.82. We adjust the ridge value in L2 regression, with 35 resulting in a similar SSE value. The two models are given as follows:
L1 : Y0.5530.0X10.0X20.359X30.170X4 w1 0.529 L2 : Y0.3940.019X10.051X20.316X30.212X4 w1 0.598
where we exclude the bias term when computing the L1 norm for the weights. We can observe that for L2 regression the coefficients for X1 and X2 are small, and therefore less important, but they are not zero. On the other hand, for L1 regression, the coefficients for attributes X1 and X2 are exactly zero, leaving only X3 and X4; Lasso can thus act as an automatic feature selection approach.
SSE6.96 SSE7.09
SSE10.15

Further Reading 621
23.7 FURTHER READING
For a geometrical approach to multivariate statistics see Wickens 2014, and Saville and Wood 2012. For a description of the class of generalized linear models, of which linear regression is a special case, see Agresti 2015. An excellent overview of Lasso and sparsity based methods is given in Hastie, Tibshirani, and Wainwright 2015. For a description of cyclical coordinate descent for L1 regression, and also other approaches for sparse statistical models see Hastie, Tibshirani, and Wainwright 2015.
Agresti, A. 2015. Foundations of Linear and Generalized Linear Models. Hoboken, NJ: John WileySons.
Hastie, T., Tibshirani, R., and Wainwright, M. 2015. Statistical Learning with Sparsity: The Lasso and Generalizations. Boca Raton, FL: CRC press.
Saville, D. J. and Wood, G. R. 2012. Statistical Methods: The Geometric Approach. New York: Springer ScienceBusiness Media.
Wickens, T. D. 2014. The Geometry of Multivariate Statistics. New York: Psychology Press, TaylorFrancis Group.
23.8 EXERCISES
Q1. Consider the data in Table 23.1, with Y as the response variable and X as the independent variable. Answer the following questions:
a Compute the predicted response vector Y for least square regression using the
geometric approach.
b Basedonthegeometricapproachextractthevalueofthebiasandslope,andgive
the equation of the best fitting regression line. Table 23.1. Data for Q1
X
Y
5 0 2 1 2
2 1 1 1 0
Q2. Given data in Table 23.2, let 0.5 be the regularization constant. Compute the equation for ridge regression of Y on X, where both the bias and slope are
penalized. Use the fact that the inverse of the matrix A a bis given as
1 1db cd AdetA c a ,withdetAadbc.

622 Linear Regression Table 23.2. Data for Q2
X
Y
1 2 4 6
1 3 4 3
Q3. ShowthatEq.23.11holds,thatis
ni1 x iy i n X Yni1x i X y i Y
w ni1xi2n2Xni1xi X2
Q4. Derive an expression for the bias term b and the weights ww1,w2, ,wdT in
multiple regression using Eq. 23.16, without adding the augmented column.
Q5. Show analytically i.e., without using the geometry that the bias term in multiple regression, as shown in Eq. 23.26, is given as
w0Yw1X1 w2X2 wdXd
Q6. ShowthatYTo0.
Q7. Provethato2Y2Y2.
Q8. Show that if i is an eigenvalue of DTD corresponding to the eigenvector ui, then
i is an eigenvalue of DT DI for the same eigenvector ui .
Q9. Show that 1QTY gives the vector of scalar projections of Y onto each of the
orthogonal basis vectors in multiple regression projU0Y
1QTYprojU1Y.
projUd Y
Q10. FormulateasolutiontotheridgeregressionproblemviaQRfactorization.
Q11. Derive the solution for the weight vector ww1,w2, ,wdT and bias bw0 in ridge regression without subtracting the means from Y and the independent attributes X1 , X2 ,, Xd , and without adding the augmented column.
Q12. Showthatthesolutionforridgeregressionandkernelridgeregressionareexactlythe same when we use a linear kernel.
Q13. ShowthatwSXTYEq.23.48isthesolutionforbivariateL1regression.
Q14. Derive the three cases for the subdifferential in Eq. 23.50, and show that they
correspond to the softthreshold update in Eq. 23.51.
Q15. Show that that the gradient at wk of the SSE term in the L1 formulation is given as
wk1YDw2XkTYDw wk 2
where Y is the response vector, and Xk is the kth predictor vector.

CHAPTER 24 LogisticRegression
GivenasetofpredictorattributesorindependentvariablesX1,X2,,Xd,andgivena categorical response or dependent variable Y, the aim of logistic regression is to predict the probability of the response variable values based on the independent variables. Logistic regression is in fact a classification technique, that given a point xjRd predicts P ci xjfor each class ci in the domain of Y the set of possible classes or values for the response variable. In this chapter, we first consider the binary class problem, where the response variable takes on one of two classes 0 and 1, or positive and negative, and so on. Next, we consider the multiclass case, where there are K classes for the response variable.
24.1 BINARY LOGISTIC REGRESSION
In logistic regression, we are given a set of d predictor or independent variables X1,X2, ,Xd, and a binary or Bernoulli response variable Y that takes on only two values, namely, 0 and 1. Thus, we are given a training dataset D comprising n points xiRd and the corresponding observed values yi0, 1. As done in Chapter 23, we augment the data matrix D by adding a new attribute X0 that is always fixed at the value 1foreachpoint,sothatx i 1,1,x2,,xdT Rd1 denotestheaugmentedpoint,and the multivariate random vector X, comprising all the independent attributes is given as X X0,X1, ,XdT. The augmented training dataset is given as Dcomprising the n augmentedpointsx i alongwiththeclasslabelsyi fori1,2,,n.
Since there are only two outcomes for the response variable Y, its probability mass function for X xis given as:
PY1Xx xPY0Xx 1x
where xis the unknown true parameter value, denoting the probability of Y1
given X x . Further, note that the expected value of Y given X xis
EYXx 1PY1Xx 0PY0Xx PY1X xx
623

624 Logistic Regression
Therefore, in logistic regression, instead of directly predicting the response value, the goal is to learn the probability, P Y1X x, which is also the expected value of Y g i v e n X x.
Since P Y1X x is a probability, it is not appropriate to directly use the linear regression model
fx 0 x0 1 x1 2 x2 d xdTx
w h e r e0 ,1 ,,dTR d1 i s t h e t r u e a u g m e n t e d w e i g h t v e c t o r , w i t h0 t h e true unknown bias term, and i the true unknown regression coefficient or weight for attributeXi.ThereasonwecannotsimplyusePY1Xx fx isduetothefact that f x can be arbitrarily large or arbitrarily small, whereas for logistic regression, we require that the output represents a probability value, and thus we need a model that results in an output that lies in the interval 0,1. The name logistic regression comes from the logistic function also called the sigmoid function that meets this requirement. It is defined as follows:
z 1expz 24.1 1expz 1expz
The logistic function squashes the output to be between 0 and 1 for any scalar input z see Figure 24.1. The outputz can therefore be interpreted as a probability.
Example 24.1 Logistic Function. Figure 24.1 shows the plot for the logistic func tion for z ranging fromto . In particular consider what happens when z is ,and 0; we have
1
1exp
1
1exp
1 1 1
1 0
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
0
0
z
Figure 24.1. Logistic function.
z

Binary Logistic Regression 625
0 11 0.5 1exp0 2
As desired,z lies in the range 0, 1, and z0 is the threshold value in the sense that for z0 we have z0.5, and for z0, we have z0.5. Thus, interpretingz as a probability, the larger the z value, the higher the probability.
Another interesting property of the logistic function is that
1z1 expz1expzexpz1 z 24.2 1expz 1expz 1expz
Using the logistic function, we define the logistic regression model as follows: PY1Xx x fxTxexp Tx24.3
1e x pT x
Thus, the probability that the response is Y1 is the output of the logistic function for
the input T x. On the other hand, the probability for Y0 is given as PY0Xx 1PY1XxTx1
1e x pT x w h e r e w e u s e d E q .2 4 . 2, t h a t i s , 1zzf o r zT x.
Combining these two cases the full logistic regression model is given as PYXxTx YTx 1Y
24.4
since Y is a Bernoulli random variable that takes on either the value 1 or 0. We can observe that PY X xTxwhen Y1 and PY X xTxwhen Y0, as desired.
LogOdds Ratio
Define the odds ratio for the occurrence of Y1 as follows: oddsY1XxPY1XxTx
PY0Xx Tx expTx 1expTx
1e x pT xe x pT x
The logarithm of the odds ratio, called the logodds ratio, is therefore given as: lnoddsY1Xx ln PY1Xxlnexp TxTx
2 4 . 5
24.6
1PY1X x 0 x0 1 x1 d xd

626 Logistic Regression The logodds ratio function is also called the logit function, defined as
logitzlnz1z
It is the inverse of the logistic function. We can see that lnoddsY1Xx logitPY1Xx
The logistic regression model is therefore based on the assumption that the logodds ratio for Y1 given X xis a linear function or a weighted sum of the independent attributes. In particular, let us consider the effect of attribute Xi by fixing the values for all other attributes in Eq. 24.6; we get
lnoddsY1Xx i xi C
oddsY1Xx expi xi Cexpi xiexpCexpi xi
where C is a constant comprising the fixed attributes. The regression coefficient i can therefore be interpreted as the change in the logodds ratio for Y1 for a unit change in Xi , or equivalently the odds ratio for Y1 increases exponentially per unit change in Xi.
24.1.1 Maximum Likelihood Estimation
Let Dbe the augmented training dataset comprising the n augmented points x i along with their labels yi. Let w w0,w1, ,wdT be the augmented weight vector for estimating . Note that w0b denotes the estimated bias term, and wi the estimated weight for attribute Xi. We will use the maximum likelihood approach to learn the weight vector w. Likelihood is defined as the probability of the observed data given the estimated parameters w. We assume that the binary response variables yi are all independent. Therefore, the likelihood of the observed responses is given as
n
nT T
ln Lw yi ln wx i 1yiln wx i 24.7
i1
The negative of the loglikelihood can also be considered as an error function, the
crossentropy error function, given as follows:
n 1 1
Ew ln Lw yi ln w Tx i 1yiln 1w Tx i 24.8 i1
The task of maximizing the loglikelihood is therefore equivalent to minimizing the crossentropy error.
w Tx iyi w Tx i1yi
Instead of trying to maximize the likelihood, we can maximize the logarithm of the
Lw PYw
i1
Pyix i
likelihood, called loglikelihood, to convert the product into a summation as follows:
n i1

Binary Logistic Regression 627
Typically, to obtain the optimal weight vector w , we would differentiate the loglikelihood function with respect to w, set the result to 0, and then solve for w. However, for the loglikelihood formulation in Eq. 24.7 there is no closed form solution to compute the weight vector w. Instead, we use an iterative gradient ascent method to compute the optimal value, since Eq. 24.7 is a concave function, and has a unique global optimal.
The gradient ascent method relies on the gradient of the loglikelihood function, which can be obtained by taking its partial derivative with respect to w, as follows:
n
wwln Lw wyi lnzi1yilnzi 24.9
i1
where ziwT xi . We use the chain rule to obtain the derivative of ln ziwith respect
to w. We note the following facts:
lnzi 1
zi zi
lnziln1zi 1
zi zi 1zi zi zi1zizizi
zi
z i wT xixi
w wAs per the chain rule, we have
lnzilnziziziw z i z iw
1 zizix i zix i zi
Likewise, using the chain rule, we have
lnzilnzizizi
24.10
w z i z iw
1 zi1zix i zix i
1zi Substituting the above into Eq. 24.9 we get
n i1
n
yizizi x i zix i
i1 n
sincezizi1
i1
yi zix i 1yizix i
w
yi zix i, i1
nT
y iwxi xi
2 4 . 1 1

628
Logistic Regression
Algorithm 24.1: Logistic Regression: Stochastic Gradient Ascent LOGISTICREGRESSIONSGA D,,o:
1 2
3 4 5 6 7 8
9 10 11
foreachxiDdox Ti 1 xTimaptoRd1 t0stepiteration counter
w00,,0TRd1initial weight vector repeat
wwt makeacopyof wt foreach x iDin random order do
w ,x iyi w Tx ix i computegradientat x i www ,x iupdateestimatefor w
wt1wu p d a t e wt1
tt1 untilw tw t1o
The gradient ascent method starts at some initial estimate for w, denoted w0. At each step t, the method moves in the direction of steepest ascent, which is given by the gradient vector. Thus, given the current estimate wt , we can obtain the next estimate as follows:
wt1wtwt 2 4 . 1 2
Here, 0 is a userspecified parameter called the learning rate. It should not be too large, otherwise the estimates will vary wildly from one iteration to the next, and it should not be too small, otherwise it will take a long time to converge. At the optimal value of w, the gradient will be zero, i.e., w0, as desired.
Stochastic Gradient Ascent
The gradient ascent method computes the gradient by considering all the data points, and it is therefore called batch gradient ascent. For large datasets, it is typically much faster to compute the gradient by considering only one randomly chosen point at a time. The weight vector is updated after each such partial gradient step, giving rise to stochastic gradient ascent SGA for computing the optimal weight vector w.
The pseudocode for SGA based logistic regression is shown in Algorithm 24.1. Given a randomly chosen point xi , the pointspecific gradient see Eq. 24.11 is given as
w ,x iyi w Tx ix i 24.13
Unlike batch gradient ascent that updates wby considering all the points, in stochastic gradient ascent the weight vector is updated after observing each point, and the updated values are used immediately in the next update. Computing the full gradient in the batch approach can be very expensive. In contrast, computing the partial gradient at each point is very fast, and due to the stochastic updates to w, typically SGA is much faster than the batch approach for very large datasets.

Binary Logistic Regression
629
X
1

X1

b Linear Regression
Y

X2
a Logistic Regression
Y

X2
Figure 24.2. Logistic versus linear regression: Iris principal components data. Misclassified point are shown
in dark gray color. Circles denote Irisvirginica and triangles denote the other two Iris types.
Once the model has been trained, we can predict the response for any new
augmented test point zas follows:
y1 if wTz 0.5 24.14
0 ifw Tz 0.5
Example 24.2 Logistic Regression. Figure 24.2a shows the output of logistic regression modeling on the Iris principal components data, where the independent attributes X1 and X2 represent the first two principal components, and the binary response variable Y represents the type of Iris flower; Y1 corresponds to Irisvirginica, whereas Y0 corresponds to the two other Iris types, namely Irissetosa and Irisversicolor.
The fitted logistic model is given as
ww0,w1,w2T 6.79,5.07,3.29T
PY1x w Tx1 1exp6.795.071 3.292
Figure 24.2a plots P Y1x for various values of x.

630 Logistic Regression
Given x , if PY1x 0.5, then we predict y1, otherwise we predict y0. Figure 24.2a shows that five points shown in dark gray are misclassified. For example, for x 1, 0.52, 1.19T we have:
PY1x w Tx 0.240.44 PY0x 1PY1x 0.54
Thus, the predicted response for xis y0, whereas the true class is y1.
Figure 24.2 also contrasts logistic versus linear regression. The plane of best fit in
linear regression is shown in Figure 24.2b, with the weight vector: w0.333,0.167,0.074T
y fx 0.3330.16710.0742
Since the response vector Y is binary, we predict the response class as y1 if f x0.5, and y0 otherwise. The linear regression model results in 17 points being misclassified dark gray points, as shown in Figure 24.2b.
Since there are n150 points in total, this results in a training set or insample accuracy of 88.7 for linear regression. On the other hand, logistic regression misclassifies only 5 points, for an insample accuracy of 96.7, which is a much better fit, as is also apparent in Figure 24.2.
24.2 MULTICLASS LOGISTIC REGRESSION
We now generalize logistic regression to the case when the response variable Y can take on K distinct nominal categorical values called classes, i.e., Yc1,c2, ,cK. We model Y as a Kdimensional multivariate Bernoulli random variable see Section 3.1.2. Since Y can assume only one of the K values, we use the onehot encoding approach to map each categorical value ci to the Kdimensional binary vector
i1 Ki
T
ei 0,,0,1,0,,0
whose ith element eii1, and all other elements eij0, so that Kj1 eij1. Henceforth, we assume that the categorical response variable Y is a multivariate BernoullivariableYe1,e2,,eK,withYj referringtothejthcomponentofY.
The probability mass function for Y given X xis
PYeiX x ix , for i1,2,,K
where i x is the unknown probability of observing class ci given X x. Thus, there are K unknown parameters, which must satisfy the following constraint:
K i1
ix
K i1
PYeiXx 1

Multiclass Logistic Regression 631 Given that only one element of Y is 1, the probability mass function of Y can be
written compactly as
PYXx K jx Yj 24.15 j1
NotethatifYei,onlyYi 1andtherestoftheelementsYj 0forji.
In multiclass logistic regression, we select one of the values, say cK, as a reference or base class, and consider the logodds ratio of the other classes with respect to cK; we assume that each of these logodd ratios are linear in X, but with a different augmented weight vector i , for class ci . That is, the logodds ratio of class ci with respect to class
cK is assumed to satisfy
lnoddsYeiXx lnPYeiXx lnixTi x
where i0i is the true bias value for class ci .
We can rewrite the above set of equations as follows:
ixe x pTi xKx
ixe x pTi x Kx G i v e n t h a tKj1jx 1 , w e h a v e
K jx 1
j1
e x pjT x KxKx1
2 4 . 1 6
j K
Kx
1
1 jK exp jTx
PYe KX xKx i0 x0 i1 x1 id xd
Substituting the above into Eq. 24.16, we have
T e x pTi x
ix exp i x Kx 1jKexp jTx
Finally, setting K0, we have expTK x1, and thus we can write the full model for
multiclass logistic regression as follows:
ix exp Ti x, foralli1,2,,K 24.17 Kj1 e x p j T x

632 Logistic Regression
This function is also called the softmax function. When K2, this formulation yields exactly the same model as in binary logistic regression.
It is also interesting to note that the choice of the reference class is not important, since we can derive the logodds ratio for any two classes ci and cj as follows:
lnix ln ix Kx jxKxjx
l n ix l n Kx
Kx l nix
Kx
jx jx
Kx
l n
That is, the logodds ratio between any two classes can be computed from the
difference of the corresponding weight vectors.
24.2.1 Maximum Likelihood Estimation
Let Dbe the augmented dataset comprising n points xi and their labels yi . We assume that yi is a onehot encoded multivariate Bernoulli response vector, so that yij denotesthejthelementofyi.Forexample,ifyi ea,thenyij 1forja,andyij 0 for all ja. We assume that all the yis are independent. Let wiRd1 denote the estimated augmented weight vector for class ci , with wi0bi denoting the bias term.
To find the K sets of regression weight vectors wi, for i1,2, ,K, we use the gradient ascent approach to maximize the loglikelihood function. The likelihood of the data is given as
n nK y LW PYWPyiX x i jx i ij
i1 i1 j1
where W w1,w2, ,wK is the set of K weight vectors. The loglikelihood is then
given as:
n K
ln LW
i1 j1
yij lnjx i
n Kexpw jTx i
yij ln K expw Tx i 24.18
i1 j1 a1 a
Ti x jT xijT x
Note that the negative of the loglikelihood function can be regarded as an error function, commonly known as crossentropy error.
We note the following facts:
l n jxi1
jx i ax i1ax ix i ifjawa a xi j xi xi if ja
jxi jxi

Multiclass Logistic Regression 633 Let us consider the gradient of the loglikelihood function with respect to the weight
v e c t o r wa :
wa
l nLWwa
n Kl n jx i jx i
yij jx iw a
n yia1 ax i1ax ix i yij1
i1 j1
i1 ax i ja j x i
ax ijx ix i Kj 1 yij1, since only one element of yi can
n yia yia ax iyij ax ix i i1 ja
nK
yia yijax i x i
i1 j1
n yia ax ix i
i1
The last step follows from the fact that be 1.

For stochastic gradient ascent, we update the weight vectors by considering only onepointatatime.Thegradientoftheloglikelihoodfunctionwithrespecttow j ata given point x i is given as
w j,x iyij jx ix i 24.19 which results in the following update rule for the j th weight vector:
wt1 wt wt, x i 24.20 jjj
where w jt denotes the estimate of w j at step t, andis the learning rate. The pseudocode for the stochastic gradient ascent method for multiclass logistic regression is shown in Algorithm 24.2. Notice that the weight vector for the base class cK is never updated; it remains wK0, as required.
Once the model has been trained, we can predict the class for any new augmented test point zas follows:
y argmaxiz argmaxexpw Ti z 24.21 c c K e x pwT z
That is, we evalute the softmax function, and then predict the class of zas the one with the highest probability.
i ij1j

634
Logistic Regression
Algorithm 24.2: Multiclass Logistic Regression Algorithm LOGISTICREGRESSIONMULTICLASS D,,o:
1 2 3
4 5 6
7 8 9
10 11
12
13 14
15 16
17 18
foreachxTi,yiDdo
xTi 1 x Tim a p t o R d1
yiej if yicjmap yi to Kdimensional Bernoulli vector
t0stepiteration counter foreach j1,2, ,K do
w jt0,,0TRd1initial weight vector
repeat
foreach j1, 2,, K1 do
w j w jtmakeacopyofw jt
X1

foreach x iDin random order do foreach j1, 2,, K1 do
e x pwjT xi
jxiKa1 e x pwTa xi
w j,x iyij jx ix i computegradientat w j w j w j w j,x iupdateestimatefor w j
foreach j1, 2,, K1 do
wt1wu p d a t e wt1
jjj tt1
until K1w tw t1o j1 j j
x 1

Y
3x
2x

X2
Figure 24.3. Multiclass logistic regression: Iris principal components data. Misclassified point are shown in dark gray color. All the points actually lie in the X1,X2 plane, but c1 and c2 are shown displaced along Y with respect to the base class c3 purely for illustration purposes.
Example 24.3. Consider the Iris dataset, with n150 points in a 2D space spanned by the first two principal components, as shown in Figure 24.3. Here, the response variable takes on three values: Yc1 corresponds to Irissetosa shown as

Further Reading 635
squares, Yc2 corresponds to Irisversicolor as circles and Yc3 corresponds to Irisvirginica as triangles. Thus, we map Yc1 to e11, 0, 0T , Yc2 to e20,1,0T and Yc3 to e30,0,1T.
The multiclass logistic model uses Yc3 Irisvirginica; triangles as the reference or base class. The fitted model is given as:
w 1 3.52,3.62,2.61T
w 2 6.95,5.18,3.40T w3 0 , 0 , 0T
Figure 24.3 plots the decision surfaces corresponding to the softmax functions:
1x
2x
3x
e x pwT1 x
1expwT1 x expwT2 x
e x pwT2 x
1expwT1 x expwT2 x1
1expwT1 x expwT2 x
The surfaces indicate regions where one class dominates over the others. It is important to note that the points for c1 and c2 are shown displaced along Y to emphasize the contrast with c3, which is the reference class.
Overall, the training set accuracy for the multiclass logistic classifier is 96.7, since it misclassifies only five points shown in dark gray. For example, for the point x 1, 0.52, 1.19T, we have:
1x 0 2x 0.448 3x 0.552
Thus, the predicted class is yargmaxci ix c3, whereas the true class is yc2.
24.3 FURTHER READING
For a description of the class of generalized linear models, of which logistic regression
is a special case, see Agresti 2015.
Agresti, A. 2015. Foundations of Linear and Generalized Linear Models. Hoboken, NJ: John WileySons.
24.4 EXERCISES
Q1. Show that z z z, where is the logistic function.
z
Q2. Showthatthelogitfunctionistheinverseofthelogisticfunction.

636
Logistic Regression
Q3. Giventhesoftmaxfunction:
Show that
xj
e x pwjT xKi1 e x pwTi x
jxax 1ax x wa a xj xx
ifj a if ja

CHAPTER 25 Neural Networks
Artificial neural networks or simply neural networks are inspired by biological neuronal networks. A real biological neuron, or a nerve cell, comprises dendrites, a cell body, and an axon that leads to synaptic terminals. A neuron transmits information via electrochemical signals. When there is enough concentration of ions at the dendrites of a neuron it generates an electric pulse along its axon called an action potential, which in turn activates the synaptic terminals, releasing more ions and thus causing the information to flow to dendrites of other neurons. A human brain has on the order of 100 billion neurons, with each neuron having between 1,000 to 10,000 connections to other neurons. Thus, a human brain is a neuronal network with 100 trillion to a quadrillion 1015 interconnections! Interestingly, as far as we know, learning happens by adjusting the synaptic strengths, since synaptic signals can be either excitatory or inhibitory, making the postsynaptic neuron either more or less likely to generate an action potential, respectively.
Artificial neural networks are comprised of abstract neurons that try to mimic real neurons at a very high level. They can be described via a weighted directed graph GV, E, with each node viV representing a neuron, and each directed edge vi , vj E representing a synaptic to dendritic connection from vi to vj . The weight of the edge wij denotes the synaptic strength. Neural networks are characterized by the type of activation function used to generate an output, and the architecture of the network in terms of how the nodes are interconnected. For example, whether the graph is a directed acyclic graph or has cycles, whether the graph is layered or not, and so on. It is important to note that a neural network is designed to represent and learn information by adjusting the synaptic weights.
25.1 ARTIFICIAL NEURON: ACTIVATION FUNCTIONS
An artificial neuron acts as a processing unit, that first aggregates the incoming signals via a weighted sum, and then applies some function to generate an output. For example, a binary neuron will output a 1 whenever the combined signal exceeds a threshold, or 0 otherwise.
637

638
Neural Networks
1 x0 bk
zk wkdi1 w i kx ib k
x1
w1k x2 w2k
. wdk .
xd
Figure 25.1 shows the schematic of a neuron zk that has incoming edges from neurons x1, ,xd. For simplicity, both the name and the output value of a neuron are denoted by the same symbol. Thus, xi denotes neuron i, and also the value of that neuron. The net input at zk , denoted netk , is given as the weighted sum
d i1
wherewk w1k,w2k,,wdkT Rd andxx1,x2,,xdT Rd isaninputpoint. Notice that x0 is a special bias neuron whose value is always fixed at 1, and the weight from x0 to zk is bk, which specifies the bias term for the neuron. Finally, the output value of zk is given as some activation function, f , applied to the net input at zk
zk fnetk
The value zk is then passed along the outgoing edges from zk to other neurons. Neurons differ based on the type of activation function used. Some commonly used
activation functions, illustrated in Figure 25.2, are:
LinearIdentityFunction: Identityisthesimplestactivationfunction;itsimplyreturns
its argument, and is also called a linear activation function:
f netk netk 25.2
Figure 25.2a plots the identity function. To examine the role of the bias term, note that netk0 is equivalent to wTxbk. That is, the output transitions from negative to positive when the weighted sum of the inputs exceeds bk, as shown in Figure 25.2b.
StepFunction: Thisisabinaryactivationfunction,wheretheneuronoutputsa0ifthe net value is negative or zero, and 1 if the net value is positive see Figure 25.2c.
f netk0 if netk0 25.3 1 ifnetk0
Figure 25.1. Artificial neuron: aggregation and activation.
netk bk
wik xi bk wTx 25.1

25.1 Artificial Neuron: Activation Functions
639

00

1.0
0.5
0
netk aLinearzk versusnetk

1.0
0.5
bk 0wTx
bLinearzk versuswTx
bk 0wTx
d Step zk versus wTx
bk 0wTx
f Rectified Linear zk versus wTx
00
0 netk
c Step zk versus netk

00

1.0
0.5

0
netk
e Rectified Linear zk versus netk

1.0
0.5
00
0 bk0
netk wTx
g Sigmoid zk versus netk h Sigmoid zk versus wTx
11
00
1
0
netk
i Hyperbolic Tangent zk versus netk
1
bk0
wTx
j Hyperbolic Tangent zk versus wT x
Figure 25.2. Neuron activation functions; also illustrating the effect of bias.
zk zk zk zk zk
zk
zk
zk
zk
zk

640 Neural Networks It is interesting to note that the transition from 0 to 1 happens when the weighted
sum of the inputs exceeds bk, as shown in Figure 25.2d.
Rectified Linear Unit ReLU: Here, the neuron remains inactive if the net input is less than or equal to zero, and then increases linearly with netk, as shown in Figure 25.2e.
fnetk0 ifnetk 0 25.4 netk if netk0
An alternative expression for the ReLU activation is given as f netk max0, netk . The transition from zero to linear output happens when the weighted sum of the inputs exceeds bk see Figure 25.2f.
Sigmoid: The sigmoid function, illustrated in Figure 25.2g squashes its input so that the output ranges between 0 and 1
f netk 1 25.5 1expnetk
When netk0, we have f netk 0.5, which implies that the transition point where the output crosses 0.5 happens when the weighted sum of the inputs exceeds bk see Figure 25.2h.
Hyperbolic Tangent tanh: The hyperbolic tangent or tanh function is similar to the sigmoid, but its output ranges between 1 and 1 see Figure 25.2i.
fnetk expnetkexpnetkexp2netk1 25.6 expnetk expnetkexp2netk 1
When netk0, we have f netk 0, which implies that the output transitions from negative to positive when the weighted sum of the inputs exceeds bk, as shown in Figure 25.2j.
Softmax: Softmax is a generalization of the sigmoid or logistic activation function. Softmax is mainly used at the output layer in a neural network, and unlike the other functions it depends not only on the net input at neuron k, but it depends on the net signal at all other neurons in the output layer. Thus, given the net input vector,
zk
netj netk
Figure 25.3. Softmax netk versus netj .

Artificial Neuron: Activation Functions 641 netnet1,net2, ,netpT, for all the p output neurons, the output of the softmax
function for the kth neuron is given as
fnet net expnetk 25.7
Figure 25.3 plots the softmax activation for netk versus the net signal netj , with all other net values fixed at zero. The output behaves similar to a sigmoid curve for any given value of netj .
25.1.1 DerivativesforActivationFunctions
For learning using a neural network, we need to consider the derivative of an activation function with respect to its argument. The derivatives for the activation functions are given as follows:
IdentityLinear: The identity or linear activation function has a derivative of 1 with respect to its argument, giving us:
f netj 1 25.8 netj
Step: The step function has a derivative of 0 everywhere except for the discontinuity at 0, where the derivative is .
ReLU: The ReLU function Eq. 25.4 is nondifferentiable at 0, nevertheless for other values its derivative is 0 if netj 0 and 1 if netj 0. At 0, we can set the derivative to be any value in the range 0,1, a simple choice being 0. Putting it all together, we have
fnetj0 ifnetj 0 25.9 netj 1 ifnetj0
Even though ReLU has a discontinuity at 0 it is a popular choice for training deep neural networks.
k pi1 expneti
Sigmoid: ThederivativeofthesigmoidfunctionEq.25.5isgivenas
fnetj fnetj1fnetj 25.10
n e tj
HyperbolicTangent: ThederivativeofthetanhfunctionEq.25.6isgivenas
fnetj 1fnetj2 25.11n e tj
Softmax: The softmax activation function Eq. 25.7 is a vector valued function, which maps a vector input netnet1 , net2 ,, netp T to a vector of probability

642 Neural Networks values. Softmax is typically used only for the output layer. The partial derivative
offnetjwithrespecttonetj isgivenas
fnetjnet fnetj1fnetj
netj
whereas the partial derivative of f netjwith respect to netk , with kj is given as
fnetjnet fnetkfnetjnetk
Since softmax is used at the output layer, if we denote the ith output neuron as oi, then f neti oi , and we can write the derivative as:
fnetjnet ojoj1oj ifkj 25.12netknetk okoj if kj
25.2 NEURAL NETWORKS: REGRESSION AND CLASSIFICATION
Networks of artificial neurons are capable of representing and learning arbitrarily
complex functions for both regression and classification tasks.
25.2.1 Regression
Consider the multiple linear regression problem, where given an input xiRd , the
goal is to predict the response as follows
yi bw1xi1 w2xi2 wdxid
Here,bisthebiasterm,andwj istheregressioncoefficientorweightforattributeXj. Given a training data D comprising n points xi in a ddimensional space, along with their corresponding true response value yi, the bias and weights for linear regression are chosen so as to minimize the sum of squared errors between the true and predicted response over all data points
n i1
As shown in Figure 25.4a, a neural network with d1 input neurons x0 , x1 ,, xd , including the bias neuron x01, and a single output neuron o, all with identity activation functions and with yo, represents the exact same model as multiple linear regression. Whereas the multiple regression problem has a closed form solution, neural networks learn the bias and weights via a gradient descent approach that minimizes the squared error.
Neural networks can just as easily model the multivariate linear regression task, where we have a pdimensional response vector yiRp instead of a single value yi. That is, the training data D comprises n points xiRd and their true response vectors yiRp . As shown in Figure 25.4b, multivariate regression can be modeled by a neural
S S E
y iy i2

Neural Networks: Regression and Classification
643
1 x0
x1 .
xd
1 x0
o x1 w11
a Single Output
Figure 25.4. Linear and logistic regression via neural networks.
b1
bp
b w1
wd
w1p wd1
o1 .
op
.
xd wdp
network with d1 input neurons, and p output neurons o1,o2, ,op, with all input and output neurons using the identity activation function. A neural network learns the weights by comparing its predicted output yoo1,o2, ,opT with the true response vector yy1,y2, ,ypT. That is, training happens by first computing the error function or loss function between o and y. Recall that a loss function assigns a score or penalty for predicting the output to be o when the desired output is y. When the prediction matches the true output the loss should be zero. The most common loss function for regression is the squared error function
1 1p
Ex2yo22 yj oj2
j1
where Ex denotes the error on input x. Across all the points in a dataset, the total sum
of squared errors is
n 1n
E Exi 2 yioi2 i1 i1
b Multiple Outputs
Example 25.1 Neural Networks for Multiple and Multivariate Regression.
Consider the multiple regression of sepal length and petal length on the dependent attribute petal width for the Iris dataset with n150 points. From Example 23.3 we find that the solution is given as
y 0.0140.08210.452
The squared error for this optimal solution is 6.179 on the training data.
Using the neural network in Figure 25.4a, with linear activation for the output and minimizing the squared error via gradient descent, results in the following learned parameters, b0.0096, w10.087 and w20.452, yielding the regression
model
o0.00960.08710.4522
with a squared error of 6.18, which is very close to the optimal solution.

644 Neural Networks
Multivariate Linear Regression For multivariate regression, we use the neural network architecture in Figure 25.4b to learn the weights and bias for the Iris dataset, where we use sepal length and sepal width as the independent attributes, and petal length and petal width as the response or dependent attributes. Therefore, each input point xi is 2dimensional, and the true response vector yi is also 2dimensional. That is, d2 and p2 specify the size of the input and output layers. Minimizing the squared error via gradient descent, yields the following parameters:
b1 b21.83
w11 w12 1.72 0.72 o21.470.7210.502
1.47 o1 1.831.721 1.462 w21 w22 1.46 0.50
The squared error on the training set is 84.9. Optimal least squared multivariate regression yields a squared error of 84.16 with the following parameters
y12.561.7811.342 y2 1.590.731 0.482
25.2.2 Classification
Networks of artificial neurons can also learn to classify the inputs. Consider the binary classification problem, where y1 denotes that the point belongs to the positive class, and y0 means that it belongs to the negative class. Recall that in logistic regression, we model the probability of the positive class via the logistic or sigmoid function:
xP y1x1
1expbwTx
where b is the bias term and ww1,w2, ,wd T is the vector of estimated weights or regression coefficients. On the other hand
Py 0x1Py 1x1x
A simple change to the neural network shown in Figure 25.4a allows it to solve the logistic regression problem. All we have to do is use a sigmoid activation function at the output neuron o, and use the crossentropy error instead of squared error. Given input x, true response y, and predicted response o, recall that the crossentropy error see Eq. 24.8 is defined as
Ex ylno1yln1o
Thus, with sigmoid activation, the output of the neural network in Figure 25.4a is
given as
ofnetosigmoidbwTx 1 x 1expbwTx
which is the same as the logistic regression model.

Neural Networks: Regression and Classification 645
Multiclass Logistic Regression In a similar manner, the multiple output neural network architecture shown in Figure 25.4b can be used for multiclass or nominal logistic regression. For the general classification problem with K classes c1,c2, ,cK, the true response y is encoded as a onehot vector. Thus, class c1 is encoded as e11,0, ,0T, class c2 is encoded as e20,1, ,0T, and so on, with ei0,1K for i1,2, ,K. Thus, we encode y as a multivariate vector ye1,e2, ,eK. Recall that in multiclass logistic regression see Section 24.2 the task is to estimate the per class bias bi and weight vector wiRd , with the last class cK used as the base class with fixed bias bK0 and fixed weight vector wK0,0, ,0TRd. The probability vector across all K classes is modeled via the softmax function see Eq. 24.17:
ix expbi wTi x , foralli1,2,,K Kj1 e x pb jw j T x
Therefore, the neural network shown in Figure 25.4b with pK can solve the multiclass logistic regression task, provided we use a softmax activation at the outputs, and use the Kway crossentropy error see Eq. 24.18, defined as
Ex y1 lno1yK lnoK
where x is an input vector, oo1,o2, ,oKT is the predicted response vector, and yy1,y2, ,yKT is the true response vector. Note that only one element of y is 1, and the rest are 0, due to the onehot encoding.
With softmax activation, the output of the neural network in Figure 25.4b with pK is given as
o Pye xfnet net expnetix i i i pj1 e x pn e t ji
which matches the multiclass logistic regression task. The only restriction we have to impose on the neural network is that the weights on edges into the last output neuron should be zero to model the base class weights wK. However, in practice, we can relax this restriction, and just learn a regular weight vector for class cK.
Example 25.2 Logistic Regression: Binary and Multiclass. We applied the neural network in Figure 25.4a, with logistic activation at the output neuron and crossentropy error function, on the Iris principal components dataset. The output is a binary response indicating Irisvirginica Y1 or one of the other Iris types Y0. As expected, the neural network learns an identical set of weights and bias as shown for the logistic regression model in Example 24.2, namely:
o6.795.07×13.29×2
Next, we we applied the neural network in Figure 25.4b, using a softmax activation and crossentropy error function, to the Iris principal components data with three classes: Irissetosa Y1, Irisversicolor Y2 and IrisvirginicaY3. Thus, we need K3 output neurons, o1, o2, and o3. Further, to obtain the same model as in the multiclass logistic regression from Example 24.3,

646
Neural Networks
X1

1 x
Y
3 x
2 x

X2
Figure 25.5. Neural networks for multiclass logistic regression: Iris principal components data. Misclassified point are shown in dark gray color. Points in class c1 and c2 are shown displaced with respect to the base class c3 only for illustration.
we fix the incoming weights and bias for output neuron o3 to be zero. The model is given as
o1 3.493.6112.652 o2 6.955.1813.402 o3 00x10x2
which is essentially the same as in Example 24.3.
If we do not constrain the weights and bias for o3 we obtain the following model:
o1 0.894.5411.962 o2 3.385.1112.882 o3 4.240.5210.922
The classification decision surface for each class is illustrated in Figure 25.5. The points in class c1 are shown as squares, c2 as circles, and c3 as triangles. This figure should be contrasted with the decision boundaries shown for multiclass logistic regression in Figure 24.3, which has the weights and bias set to 0 for the base class c3.
25.2.3 Error Functions
Typically, for a regression task, we use squared error as the loss function, whereas for classification, we use the crossentropy loss function. Furthermore, when learning from neural networks, we will require the partial derivatives of the error function with respect to the output neurons. Thus, the commonly used error functions and their derivatives are listed below:

Neural Networks: Regression and Classification 647
SquaredError: GivenaninputvectorxRd,thesquarederrorlossfunctionmeasures the squared deviation between the predicted output vector oRp and the true response yRp , defined as follows:
1 1p
Ex2yo22 yj oj2 25.13
j1
The partial derivative of the squared error function with respect to a particular
where Ex denotes the error on input x. output neuron oj is
Ex 12yjoj1ojyj oj 2
Across all the output neurons, we can write this as
Ex oy o
25.14
25.15
CrossEntropyError: Forclassificationtasks,withKclassesc1,c2,,cK,weusually set the number of output neurons pK, with one output neuron per class. Furthermore,eachoftheclassesiscodedasaonehotvector,withclassci encodedas the ith standard basis vector eiei1,ei2, ,eiKT0,1K, with eii1 and eij0 for all ji. Thus, given input xRd, with the true response yy1,y2, ,yKT, where ye1,e2, ,eK, the crossentropy loss is defined as

The partial derivative of the crossentropy error function with respect to a particular output neuron oj is
Ex yj 25.17oj oj
The vector of partial derivatives of the error function with respect to the output neurons is therefore given as
Ex Ex Ex Ex Ty1 y2 yK T
oo ,o ,, oo ,o ,,o 25.18
12K12K
Binary CrossEntropy Error: For classification tasks with binary classes, it is typical to encode the positive class as 1 and the negative class as 0, as opposed to using a onehot encoding as in the general Kclass case. Given an input xRd , with
K i1
Ex
Note that only one element of y is 1 and the rest are 0 due to the onehot encoding.
yi lnoi y1 lno1yK lnoK 25.16 Thatis,ifyei,thenonlyyi 1,andtheotherelementsyj 0forji.

648 Neural Networks true response y0,1, there is only one output neuron o. Therefore, the binary
crossentropy error is defined as
Ex ylno1yln1o 25.19
Here y is either 1 or 0. The partial derivative of the binary crossentropy error function with respect to the output neuron o is
25.20
Ex ylno1yln1o o o
y1y 1 y1o1yo o 1o o1o
oy o1o
25.3 MULTILAYER PERCEPTRON: ONE HIDDEN LAYER
A multilayer perceptron MLP is a neural network that has distinct layers of neurons. The inputs to the neural network comprise the input layer, and the final outputs from the MLP comprise the output layer. Any intermediate layer is called a hidden layer, and an MLP can have one or many hidden layers. Networks with many hidden layers are called deep neural networks. An MLP is also a feedforward network. That is, information flows in the forward direction, and from a layer only to the subsequent layer. Thus, information flows from the input to the first hidden layer, from the first to the second hidden layer, and so on, until it reaches the output layer from the last hidden layer. Typically, MLPs are fully connected between layers. That is, each neuron in the input layer is connected to all the neurons in the first hidden layer, and each neuron in the first hidden layer is connected to all neurons in the second hidden layer, and so on, and finally, each neuron in the last hidden layer is connected to all neurons in the output layer.
For ease of explanation, in this section, we will consider an MLP with only one hidden layer, and we will later generalize the discussion to deep MLPs. For example, Figure 25.6 shows an MLP with one hidden layer. The input layer has d neurons, x1,x2,,xd, and an additional neuron x0 that specifies the biases for the hidden layer. The hidden layer has m neurons, z1,z2, ,zm, and an additional neuron z0 that specifies the biases for the output neurons. Finally, the output layer has p neurons, o1,o2, ,op. The bias neurons have no incoming edges, since their value is always fixed at 1. Thus, in total there are dmmp weight parameters wijand a further mp bias parameters bithat need to be learned by the neural network. These parameters also correspond to the total number of edges in the MLP.
25.3.1 Feedforward Phase
Let D denote the training dataset, comprising n input points xiRd and corresponding true response vectors yiRp . For each pair x, y in the data, in the feedforward phase,

Multilayer Perceptron: One Hidden Layer
649
Input Layer
1 x0
Hidden Layer
1 z0
Output Layer
bk
x1 z1 o1
w1k
wik
. .
wk1
zk wkj oj
wkp
. xi .
Figure 25.6. Multilayer perceptron with one hidden layer. Input and output links for neuron zk are shown in bold. Neurons x0 and z0 are bias neurons.
thepointxx1,x2,,xdT Rd issuppliedasaninputtotheMLP.Theinputneurons do not use any activation function, and simply pass along the supplied input values as their output value. This is equivalent to saying that the net at input neuron k is netkxk , and the activation function is the identity function f netk netk , so that the output value of neuron k is simply xk .
Given the input neuron values, we compute the output value for each hidden neuron zk as follows:
dzkfnetkf bk wikxi
i1
where f is some activation function, and wik denotes the weight between input neuron xi and hidden neuron zk . Next, given the hidden neuron values, we compute the value for each output neuron oj as follows:
moj fnetjf bjwij zi
i1
where wij denotes the weight between hidden neuron zi and output neuron oj .
We can write the feedforward phase as a series of matrixvector operations. For this we define the dm matrix Wh comprising the weights between input and hidden layer neurons, and vector bhRm comprising the bias terms for hidden layer neurons,
.
xd zm op
wdk
.

650
Neural Networks
given as
w11 w12w1m
w21 w22w2m Wh. ..
wd1 wd2wdm
b1 b2
bh.25.21 bm
where wij denotes the weight on the edge between input neuron xi and hidden neuron zj , and bi denotes the bias weight from x0 to zi . The net input and the output for all the hidden layer neurons can be computed via a matrixvector multiplication operation, as follows:
n e t hb hW hT x2 5 . 2 2zf nethf bhWhTx 25.23
Here, nethnet1, ,netmT denotes the net input at each hidden neuron excluding the bias neuron z0 whose value is always fixed at z01, and zz1,z2, ,zmT denotes the vector of hidden neuron values. The activation function fapplies to, or distributes over, each element of neth, i.e., f nethf net1, ,f netmTRm. Typically, all neurons in a given layer use the same activation function, but they can also be different if desired.
Likewise, let WoRmp denote the weight matrix between the hidden and output layers, and let boRp be the bias vector for output neurons, given as
w11 w12w1p b1 w21 w22w2p b2
Wo. ..bo.25.24 wm1 wm2wmp bp
where wij denotes the weight on the edge between hidden neuron zi and output neuron oj , and bi the bias weight between z0 and output neuron oi . The output vector can then be computed as follows:
n e t ob oW oT z2 5 . 2 5of netof boWoTz 25.26
To summarize, for a given input xD with desired response y, an MLP computes the output vector via the feedforward process, as follows:
ofb oW oT z fb oW oTfb hW hT x2 5 . 2 7where, oo1,o2, ,opT is the vector of predicted outputs from the single hidden
layer MLP.
25.3.2 Backpropagation Phase
Backpropagation is the algorithm used to learn the weights between successive layers in an MLP. The name comes from the manner in which the error gradient is propagated

Multilayer Perceptron: One Hidden Layer 651
backwards from the output to input layers via the hidden layers. For simplicity of exposition, we will consider backpropagation for an MLP with a single hidden layer with m neurons, with squared error function, and with sigmoid activations for all neurons. We will later generalize to multiple hidden layers, and other error and activation functions.
LetDdenotethetrainingdataset,comprisingninputpointsxi xi1,xi2,,xidTRd and corresponding true response vectors yiRp. Let WhRdm denote the weight matrix between the input and hidden layer, and bhRm the vector of bias terms for the hidden neurons from Eq. 25.21. Likewise, let WoRmp denote the weight matrix between the hidden and output layer, and boRp the bias vector for output neurons from Eq. 25.24.
For a given input pair x, y in the training data, the MLP first computes the output vector o via the feedforward step in Eq. 25.27. Next, it computes the error in the predicted output visavis the true response y using the squared error function
1 1p
Ex2yo22 yj oj2 25.28
j1
The basic idea behind backpropagation is to examine the extent to which an output neuron, say oj , deviates from the corresponding target response yj , and to modify the weights wij between each hidden neuron zi and oj as some function of the errorlarge error should cause a correspondingly large change in the weight, and small error should result in smaller changes. Likewise, the weights between all input and hidden neurons should also be updated as some function of the error at the output, as well as changes already computed for the weights between the hidden and output layers. That is, the error propagates backwards.
The weight update is done via a gradient descent approach to minimize the error. Let wij be the gradient of the error function with respect to wij , or simply the weight gradient at wij . Given the previous weight estimate wij , a new weight is computed by taking a small stepin a direction that is opposite to the weight gradient at wij
wij wij wij 25.29 Inasimilarmanner,thebiastermbj isalsoupdatedviagradientdescent
bj bj bj 25.30 wherebj isthegradientoftheerrorfunctionwithrespecttobj,whichwecallthebias
gradient at bj .
Updating Parameters Between Hidden and Output Layer
Consider the weight wij between hidden neuron zi and output neuron oj , and the bias term bj between z0 and oj . Using the chain rule of differentiation, we compute the weightgradientatwij andbiasgradientatbj,asfollows:
wijEx Ex netjjzi wij netj wij
bjExEx netjj bj netj bj
25.31

652 Neural Networks whereweusethesymbolj todenotethepartialderivativeoftheerrorwithrespectto
net signal at oj , which we also call the net gradient at oj
jEx 25.32
n e tj Furthermore,thepartialderivativeofnetj withrespecttowij andbj isgivenas
netjmnetjm
ww bjwkj zk zi bb bjwkj zk 1
ij ij k1 j j k1
whereweusedthefactthatbj andallwkj forkiareconstantswithrespecttowij. Next, we need to compute j , the net gradient at oj . This can also be computed via
the chain rule
jExEx fnetj 25.33netj f netj netj
Notethatfnetjoj.Thus,j iscomposedoftwoterms,namelythepartialderivative of the error term with respect to the output or activation function applied to the net signal, and the derivative of the activation function with respect to its argument. Using the squared error function and from Eq. 25.14, for the former, we have
ExEx1pfnetjoj oj 2 ykok2ojyj
k1
where we used the observation that all ok for kj are constants with respect to oj .
Since we assume a sigmoid activation function, for the latter, we have via Eq. 25.10
fnetj oj 1oj netj
Putting it all together, we get
j oj yjoj 1oj
Let o1,2,,pT denote the vector of net gradients at each output neuron, which we call the net gradient vector for the output layer. We can write o as
o o1ooy 25.34
wheredenotes the elementwise product also called the Hadamard product between the vectors, and where oo1,o2, ,opT is the predicted output vector, yy1,y2, ,ypT is the true response vector, and 11, ,1TRp is the pdimensional vector of all ones.
Letzz1,z2,,zmT denotethevectorcomprisingthevaluesofallhiddenlayer neurons after applying the activation function. Based on Eq. 25.31, we can compute thegradientswij forallhiddentooutputneuronconnectionsviatheouterproductof

Multilayer Perceptron: One Hidden Layer
653
z and o:
w11 w12 w21 w22
Wo. .wm1 wm2
w1p
w2p zT 25.35
where WoRmp is the matrix of weight gradients. The vector of bias gradients is given as:
bo b1,b2,,bpT o 25.36 Once the gradients have been computed, we can update all the weights and biases
.o wmp
wherebo Rp. as follows
WoWoWo bobobo
whereis the step size also called the learning rate for gradient descent.
Updating Parameters Between Input and Hidden Layer
25.37
Consider the weight wij between input neuron xi and hidden neuron zj , and the bias term between x0 and zj . The weight gradient at wij and bias gradient at bj is computed similarly to Eq. 25.31
wijEx Ex netjjxi wij netj wij
bjExEx netjj bj netj bj
25.38
which follows from
netjmnetjm
bjwkj xk xi bb bjwkj xk 1 ij ij k1 j j k1
ww
To compute the net gradient j at the hidden neuron zj we have to consider the error gradients that flow back from all the output neurons to zj . Applying the chain rule, we get:
jEx p Ex netkzj zj p Ex netk netj k1 netk zj netj netj k1 netk zj
p k1
zj 1zj
zj1zj , since we assume a sigmoid activation function for the hidden
k wjk
neurons. The chain rule leads to a natural interpretation for backpropagation, namely,
where zjn e tj

654
Neural Networks
Input Layer
1 x0
bj
xi wij
Hidden Layer
Output Layer
o1 1 .
ok k .
op p
p k1
zj
k wjk
wj 1 wj k wjp
Figure 25.7. Backpropagation of gradients from output to hidden layer.
tofindthenetgradientatzj wehavetoconsiderthenetgradientsateachoftheoutput neurons k but weighted by the strength of the connection wjk between zj and ok, as illustrated in Figure 25.7. That is, we compute the weighted sum of gradients pk1 kwj k , which is used to compute j , the net gradient at hidden neuron zj .
Let o1,2,,pT denote the vector of net gradients at the output neurons, and h1,2,,mT the net gradients at the hidden layer neurons. We can write h compactly as
h z1zWo o 25.39
whereis the elementwise product, 11,1, ,1Rm is the vector of all ones and zz1,z2, ,zmT is the vector of hidden layer outputs. Furthermore, WooRm is the vector of weighted gradients at each hidden neuron, since
p p p T Woo kw1k, kw2k, , kwmk
k1 k1 k1
Let xx1,x2, ,xdT denote the input vector, then based on Eq. 25.38 we can compute the gradients wij for all input to hidden layer connections via the outer product:
w11 w1m
w21w2mxT 25.40
Wh..h wd1wdm
where WhRdm is the matrix of weight gradients. The vector of bias gradients is given as:
bh b1,b2,,bmT h 25.41
where bhRm.

Multilayer Perceptron: One Hidden Layer 655
Algorithm 25.1: MLP Training: Stochastic Gradient Descent MLPTRAINING D,m,,maxiter:
Initialize bias vectors
1 bhrandom mdimensional vector with small values
2 borandom pdimensional vector with small values
Initialize weight matrices
3 Whrandom dm matrix with small values
4 Worandom mp matrix with small values
5 t0iteration counter
6 repeat
9 o if b oW oT z i
7 foreach xi , yi D in random order doFeedforward phase
8 zi fbhWTxi h
Backpropagation phase: net gradients
10 o oi1oioi yi
11 hzi 1ziWoo
Gradient descent for bias vectors
12 boo; bobobo
13 bhh; bhbhbh
Gradient descent for weight matrices
14 WoziTo; WoWoWo
15 WhxiTh; WhWhWh
16 tt1
17 until tmaxiter
Once the gradients have been computed, we can update all the weights and biases as follows
WhWhWh bhbhbh
whereis the step size or learning rate. 25.3.3 MLP Training
25.42
Algorithm 25.1 shows the pseudocode for learning the weights considering all of the input points in D via stochastic gradient descent. The code is shown for an MLP with a single hidden layer, using a squared error function, and sigmoid activations for all hidden and output neurons. The approach is called stochastic gradient descent since we compute the weight and bias gradients after observing each training point in random order.
The MLP algorithm takes as input the dataset D with points xi and desired responsesyi fori1,2,,n,thenumberofhiddenlayerneuronsm,thelearningrate

656 Neural Networks
, and an integer threshold maxiter that specifies the maximum number of iterations. The size of the input d and output p layers is determined automatically from D. The MLP first initializes the dm input to hidden layer weight matrix Wh, and the mp hidden to output layer matrix Wo to small values, for example, uniformly random in the range 0.01,0.01. It is important to note that weights should not be set to 0, otherwise, all hidden neurons will be identical in their values, and so will be the output neurons.
The MLP training takes multiple iterations over the input points. For each input xi, the MLP computes the output vector oi via the feedforward step. In the backpropagation phase, we compute the error gradient vector o with respect to the net at output neurons, followed by h for hidden neurons. In the stochastic gradient descent step, we compute the error gradients with respect to the weights and biases, which are used to update the weight matrices and bias vectors. Thus, for every input vector xi , all the weights and biases are updated based on the error incurred between the predicted outputoi andthetrueresponseyi.Aftereachinputhasbeenprocessed,thatcompletes one iteration of training, called an epoch. Training stops when the maximum number of iterations, maxiter, has been reached. On the other hand, during testing, for any input x, we apply the feedforward steps and print the predicted output o.
In terms of computational complexity, each iteration of the MLP training algorithm takes Odmmp time for the feedforward phase, pmpmOmp time for the backpropagation of error gradients, and Odmmp time for updating the weight matrices and bias vectors. Thus, the total training time per iteration is Odmmp.
Example 25.3 MLP with one hidden layer. We now illustrate an MLP with a hid den layer using a nonlinear activation function to learn the sine curve. Figure 25.8a shows the training data the gray points on the curve, which comprises n25 points xi sampled randomly in the range 10, 10, with yisinxi . The testing data comprises 1000 points sampled uniformly from the same range. The figure also shows the desired output curve thin line. We used an MLP with one input neuron d1, ten hidden neurons m10 and one output neuron p1. The hidden neurons use tanh activations, whereas the output unit uses an identity activation. The step size is 0.005.
The input to hidden weight matrix WhR110 and the corresponding bias vector bhR101 are given as:
Wh 0.68,0.77,0.42,0.72,0.93,0.42,0.66,0.70,0.62,0.50 bh 4.36,2.43,0.52,2.351.64,3.98,0.31,4.45,1.03,4.77T
The hidden to output weight matrix WoR101 and the bias term boR are given as: Wo 1.82,1.69,0.82,1.37,0.14,2.37,1.64,1.92,0.78,2.17T
bo0.16
Figure 25.8a shows the output of the MLP on the test set, after the first iteration of training t1. We can see that initially the predicted response deviates

1.0 0.5 0 0.5 1.0 1.5 2.0 2.5
1.00 0.75 0.50 0.25
0 0.25 0.50 0.75 1.00 1.25
1.00 0.75 0.50 0.25
0 0.25 0.50 0.75 1.00 1.25

1.000.75

25.3 Multilayer Perceptron: One Hidden Layer
657

0 0.25 0.50

a t1
1.25 1.00 0.75 0.50 0.25
0 0.25 0.50 0.75 1.00 1.25

b t1000

0.75
1.00

1.25 XX
12 10 8 6 4 2 0 2 4 6 8 10 12 e t15000
12 10 8 6 4 2 0 2 4 6 8 10 12 f t30000

12 10 8 6 4 2 0 2 4 6 8 10 12 XX

0.50 0.25 0 0.25 0.50 0.75 1.00 1.25

12 10 8 6 4 2 0 2 4 6 8 10 12

12 10 8 6 4 2 0 2 4 6 8 10 12 XX
c t5000
12 10 8 6 4 2 0 2 4 6 8 10 12 d t10000

1.00 0.75 0.50 0.25

2.0 1.5 1.0 0.5
0 0.5 1.0 1.5 2.0 2.5

20 15 10 5
g Test range 20,20
15 20
0 5 10
X
Figure 25.8. MLP for sine curve: 10 hidden neurons with hyperbolic tangent activation functions. The gray dots represent the training data. The bold line is the predicted response, whereas the thin line is the true response. af: Predictions after different number of iterations. g: Testing outside the training range. Good fit within the training range 10, 10 shown in the box.

Y
YYY
YYY

658 Neural Networks
significantly from the true sine response. Figure 25.8af show the output from the MLP after different number of training iterations. By t15000 iterations the output on the test set comes close to the sine curve, but it takes another 15000 iterations to get a closer fit. The final SSE is 1.45 over the 1000 test points.
We can observe that, even with a very small training data of 25 points sampled randomly from the sine curve, the MLP is able to learn the desired function. However, it is also important to recognize that the MLP model has not really learned the sine function; rather, it has learned to approximate it only in the specified range 10, 10. We can see in Figure 25.8g that when we try to predict values outside this range, the MLP does not yield a good fit.
Example 25.4 MLP for handwritten digit classification. In this example, we apply an MLP with one hidden layer for the task of predicting the correct label for a handwritten digit from the MNIST database, which contains 60,000 training images that span the 10 digit labels, from 0 to 9. Each grayscale image is a 2828 matrix of pixels, with values between 0 and 255. Each pixel is converted to a value in the interval 0, 1 by dividing by 255. Figure 25.9 shows an example of each digit from the MNIST dataset.
Since images are 2dimensional matrices, we first flatten them into a vector xR784 with dimensionality d2828784. This is done by simply concatenating all of the rows of the images to obtain one long vector. Next, since the output labels are categorical values that denote the digits from 0 to 9, we need to convert them into binary numerical vectors, using onehot encoding. Thus, the label 0 is encoded as e11,0,0,0,0,0,0,0,0,0TR10, the label 1 as e20,1,0,0,0,0,0,0,0,0TR10, and so on, and finally the label 9 is encoded as e100,0,0,0,0,0,0,0,0,1TR10. That is, each input image vector x has a corresponding target response vector ye1,e2,,e10.Thus,theinputlayerfortheMLPhasd784neurons,andtheoutput layer has p10 neurons.
For the hidden layer, we consider several MLP models, each with a different number of hidden neurons m. We try m0,7,49,98,196,392, to study the effect of increasing the number of hidden neurons, from small to large. For the hidden layer, we use ReLU activation function, and for the output layer, we use softmax activation, since the target response vector has only one neuron with value 1, with the rest being 0. Note that m0 means that there is no hidden layerthe input layer is directly connected to the output layer, which is equivalent to a multiclass logistic regression model. We train each MLP for t15 epochs, using step size 0.25.
During training, we plot the number of misclassified images after each epoch, on the separate MNIST test set comprising 10,000 images. Figure 25.10 shows the number of errors from each of the models with a different number of hidden neurons m, after each epoch. The final test error at the end of training is given as
m
0
7
10
49
98
196
392
errors
1677
901
792
546
495
470
454

Multilayer Perceptron: One Hidden Layer 659 00000
55555
10 10 10 10 10
15 15 15 15 15
20 20 20 20 20
25 25 25 25 25
0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25
a label 0 b label 1 c label 2 d label 3 e label 4
00000
55555
10 10 10 10 10
15 15 15 15 15
20 20 20 20 20
25 25 25 25 25
0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25
f label 5 g label 6 h label 7 i label 8 j label 9 Figure 25.9. MNIST dataset: Sample handwritten digits.
3,000
2,500
2,000
1,500
1,000
500
We can observe that adding a hidden layer significantly improves the prediction accuracy. Using even a small number of hidden neurons helps, compared to the logistic regression model m0. For example, using m7 results in 901 errors or error rate 9.01 compared to using m0, which results in 1677 errors or error rate 16.77. On the other hand, as we increase the number of hidden neurons, the error rate decreases, though with diminishing returns. Using m196, the error rate is 4.70, but even after doubling the number of hidden neurons m392, the error rate goes down to only 4.54. Further increasing m does not reduce the error rate.
errors
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 epochs
Figure 25.10. MNIST: Prediction error as a function of epochs.
m0
m7 m10 m49 m98 m196 m392

660 Neural Networks 25.4 DEEP MULTILAYER PERCEPTRONS
We now generalize the feedforward and backpropagation steps for many hidden layers, as well as arbitrary error and neuron activation functions.
Consider an MLP with h hidden layers as shown in Figure 25.11. We assume that the input to the MLP comprises n points xiRd with the corresponding true response vector yiRp . We denote the input neurons as layer l0, the first hidden layer as l1, the last hidden layer as lh, and finally the output layer as layer lh1. We use nl to denote the number of neurons in layer l. Since the input points are ddimensional, this implies n0d, and since the true response vector is pdimensional, we have nh1p. The hidden layers have n1 neurons for the first hidden layer, n2 for the second layer, and nh for the last hidden layer. The vector of neuron values for layer l for l0, ,h1 is denoted as
z l z 1l ,, z nlT l
Each layer except the output layer has one extra bias neuron, which is the neuron at
index 0. Thus, the bias neuron for layer l is denoted z0l
and its value is fixed at z0l1.
l0 1 x0
x 1 . x i . x d
l 0
W0,b0
l1 1 z01
z 1 1 . z i1 .
l2 1 z02
z 12 . z i2 .

1
.

lh 1 z0h
z 1h . z ih .
lh1
o 1 . o i . o p
l h1
Wh,bh
.
z n1 z n2z nh
12h
a detail view
l 1 l 2 l h
W1,b1
b layer view
Figure 25.11. Deep multilayer perceptron, with h hidden layers.
x
z1
z2

zh
o

Deep Multilayer Perceptrons 661
Figure 25.11a displays a detailed view of an MLP with h hidden layers, showing the individual neurons in each layer including the bias neuron. Note that the vector of input neuron values is also written as
xx1,x2,,xdT z10,z20,,zd0T z0 and the vector of output neuron values is also denoted as
oo1,o2, ,opTzh1,zh1, ,zh1Tzh1 12p
Theweightmatrixbetweenlayerlandlayerl1neuronsisdenotedWl Rnlnl1, andthevectorofbiastermsfromthebiasneuronz0l toneuronsinlayerl1isdenoted blRnl1 , for l0,1, ,h. Thus, W0Rdn1 is the weight matrix between the input and first hidden layer, W1Rn1n2 is the weight matrix between the first and second hidden layer, and so on; finally, WhRnh p is the weight matrix between the last hidden layerandtheoutputlayer.Forthebiasvectors,b0Rn1 specifiesthebiasesforneurons inthefirsthiddenlayer,b1Rn2 thebiasesforneuronsinthesecondhiddenlayer,and so on. Thus, bhRp specifies the biases for the output neurons. Figure 25.11b shows a layer view for an MLP with h hidden layers. This is a more compact representation that clearly specifies the architecture or topology of the MLP. Each layer l is represented as a rectangular node, and is marked with the vector of neuron values, zl. The bias neurons are not shown, but are assumed to be present in each layer except the output. An edge between layer l and l1 is labeled with the weight matrix Wl and bias vector bl that specify the parameters between those layers.
For training deep MLPs, we will refer to several partial derivative vectors, described next. Define il as the net gradient, i.e., the partial derivative of the error function with respect to the net value at zil
ilExneti
andletl denotethenetgradientvectoratlayerl,forl1,2,,h1 l1l,,nl T
l
25.43
25.44
Let f l denote the activation function for layer l, for l0,1, ,h1, and further let fl denote the vector of the derivatives of the activation function with respect to neti for all neurons zil in layer l:
l flnet1 flnetnlT
fnet ,, net 25.45
1 nl
Finally, letEx denote the vector of partial derivatives of the error function with respect
to the values oi for all output neurons:
Ex Ex Ex T
Ex o,o,,o 25.46
12p

662 Neural Networks 25.4.1 Feedforward Phase
Typically in a deep MLP, the same activation function f l is used for all neurons in a given layer l. The input layer always uses the identity activation, so f 0 is the identity function. Also, all bias neurons also use the identity function with a fixed value of 1. The output layer typically uses sigmoid or softmax activations for classification tasks, or identity activations for regression tasks. The hidden layers typically use sigmoid, tanh, or ReLU activations. In our discussion, we assume a fixed activation function f l for all neurons in a given layer. However, it is easy to generalize to a neuron specific activation function f l for neuron zl in layer l.
feedforward process:
of h1b hW hTz h
fh1b WTfhb WT zh1 h h h1 h1
ii
For a given input pair x, yD, the deep MLP computes the output vector via the
.
fh1b WT fhb WT fh1f2b WT f1b WT x
h h h1 h1 1 1 0 0 Note that each f l distributes over its argument. That is,
f lb l1W Tl1xf ln e t 1, f ln e t 2,, f ln e t n l T 25.4.2 Backpropagation Phase
Consider the weight update between a given layer and another, including between the
input and hidden layer, or between two hidden layers, or between the last hidden layer
and the output layer. Let zl be a neuron in layer l, and zl1 a neuron in the next layer ij
l1.Letwl betheweightbetweenzl andzl1,andletbl denotethebiastermbetween ij ijj
zl andzl1.Theweightandbiasareupdatedusingthegradientdescentapproach 0j
wl wl l blbll ij ij wij j j bj
where wl is the weight gradient and bl is the bias gradient, i.e., the partial derivative ij j
of the error function with respect to the weight and bias, respectively:
l Ex lEx
wij wl bj
ij j
As noted earlier in Eq. 25.31, we can use the chain rule to write the weight and bias gradient, as follows
l Ex Ex netjl1zlzll1 wij wl netj wl j i i j
l ExEx netj l1 b jb jln e t jb jl j
bl
ij ij

Deep Multilayer Perceptrons 663 where l1 is the net gradient Eq. 25.43, i.e., the partial derivative of the error
j
function with respect to the net value at zl1, and we have j
nnn e t jln e t jl
bl wl zl zl
wl wl j kj k i bl bl j kj k
bl wl zl 1 ij ij k0 j j k0
Given the vector of neuron values at layer l, namely zlz1l , ,znl T, we can l
compute the entire weight gradient matrix via an outer product operation
and the bias gradient vector as:
Wl zll1T 25.47 bl l1 25.48
withl0,1,,h.Herel1 isthenetgradientvectoratlayerl1Eq.25.44. This also allows us to update all the weights and biases as follows
gradients for layer l we need to compute the net gradients l1 at layer l1. 25.4.3 Net Gradients at Output Layer
Let us consider how to compute the net gradients at the output layer h1. If all of the output neurons are independent for example, when using linear or sigmoid activations, the net gradient is obtained by differentiating the error function with respect to the net signal at the output neurons. That is,
h1 ExEx fh1netjEx fh1netj j netj f h1netjnetj oj netj
Thus, the gradient depends on two terms, the partial derivative of the error function with respect to the output neuron value, and the derivative of the activation function with respect to its argument. The net gradient vector across all output neurons is given as
h1 fh1Ex 25.50
whereis the elementwise or Hadamard product, fh1 is the vector of derivatives of the activation function with respect to its argument Eq. 25.45 at the output layer lh1, andE x is the vector of error derivatives with respect to the output neuron values Eq. 25.46.
Wl Wl Wl bl bl bl
25.49 whereis the step size. However, we observe that to compute the weight and bias

664 Neural Networks
On the other hand, if the output neurons are not independent for example, when using a softmax activation, then we have to modify the computation of the net gradient at each output neuron as follows:
h1 Ex p Ex fh1neti j netj i1 fh1neti netj
Across all output neurons, we can write this compactly as follows:
h1 Fh1 Ex 25.51
whereFh1 isthematrixofderivativesofoi fh1netiwithrespecttonetj forall
i,j1,2, ,p, given as
o1 o1 o1
net1 net2 netp
o2 o2o2net1 net2 netp
Typically, for regression tasks, we use the squared error function with linear activation function at the output neurons, whereas for logistic regression and classification, we use the crossentropy error function with a sigmoid activation for binary classes, and softmax activation for multiclass problems. For these common cases, the net gradient vector at the output layer is given as follows:
F
h1
. .
op op op
SquaredError: FromEq.25.15,theerrorgradientisgivenas ExEx oy
.
net1net2netp

o The net gradient at the output layer is given as
h1 fh1 Ex
where fh1 depends on the activation function at the output. Typically, for regression tasks, we use a linear activation at the output neurons. In that case, we havef h11 see Eq. 25.8.
CrossEntropyErrorbinaryoutput,sigmoidactivation: Considerthebinarycasefirst, with a single output neuron o with sigmoid activation. Recall that the binary crossentropy error Eq. 25.19 is given as
Ex ylno1yln1o From Eq. 25.20 we have
ExEx oy o o1o

Deep Multilayer Perceptrons 665 Further, for sigmoid activation, we have
fh1fneto o1oneto
Therefore, the net gradient at the output neuron is
h1 Ex fh1oy o1ooy
o1o
CrossEntropy Error K outputs, softmax activation: Recall that the crossentropy
error function Eq. 25.16 is given as

K i1
Ex Ex ExTy1 y2 yKT Exo ,o ,, oo ,o ,,o
Ex
Using Eq. 25.17, the vector of error derivatives with respect to the output neurons
is given as
12K12K where pK is the number of output neurons.
Crossentropy error is typically used with the softmax activation so that we get a normalized probability value for each class. That is,
yi lnoi y1 lno1yK lnoK
expnetjojsoftmaxnetj
so that the output neuron values sum to one, Kj 1 oj1. Since an output neuron depends on all other output neurons, we need to compute the matrix of derivatives of each output with respect to each of the net signals at the output neurons see Equations 25.12 and 25.18:
o1 o1 o1
net1 net2 netKo1 1o1
o1 o2o2 1o2
o2 oK
Ki1 expneti
o2 o2o2 o1 o2
o1 oK
.
Fh1 net1 net2. .
netK . . .
.
.
. . ..o1oK o2oKoK1oK
oK oKoKnet1net2netK

666
Neural Networks
Therefore, the net gradient vector at the output layer is h1 Fh1 Ex

o1 1o1
. . . . . .o1 oK o2 oK
o1 o2o 1o
25.52y1
2Ko2 . . . . . .
o o
12 2 2
o1 oK o o
o1
y2
y1 o1 Ki1 yi .K .K
yKyKoK iKyioK yKoK i1yi
y1 o1
y 2o 2 K
. ,since yK oK
oy
25.4.4 Net Gradients at Hidden Layers
Let us assume that we have already computed the net gradients at layer l1, namely
l1.Sinceneuronzjl inlayerlisconnectedtoalloftheneuronsinlayerl1except
for the bias neuron zl1, to compute the net gradient at zl , we have to account for the 0j
error from each neuron in layer l1, as follows:
n
i1
yi 1
oK 1oK yK oK
y1 y1 o1 Ki1 yi o1
y2y2o2Ki2yi o2 y2o2Ki1yi
l
E l1 E net f net
jlx x k j

k1
k1 netk flnetj netj
netj
Sothenetgradientatzjl inlayerldependsonthederivativeoftheactivationfunction
with respect to its netj , and the weighted sum of the net gradients from all the neurons
zl1 at the next layer l1. k
We can compute the net gradients for all the neurons in level l in one step, as follows:
l flWl l1 25.53
whereis the elementwise product, and fl is the vector of derivatives of the activation function with respect to its argument Eq. 25.45 at layer l. For the commonly used activation functions at the hidden layer, using the derivatives from
n l fnetj l1
netj
l1 wl k jk

Deep Multilayer Perceptrons
667
Section 25.1.1, we have
For ReLU, we have to apply Eq. 25.9 to each neuron. Note that softmax is generally not used for hidden layers.
The net gradients are computed recursively, starting from the output layer h1, then hidden layer h, and so on, until we finally compute the net gradients at the first hidden layer l1. That is,
hfh Whh1 h1 fh1Wh1 hfh1Wh1 fhWh h1
.
1 f1W1 f2W2 fhWh h1
25.4.5 Training Deep MLPs
Algorithm 25.2 shows the pseudocode for learning the weights and biases for a deep MLP. The inputs comprise the dataset D, the number of hidden layers h, the step size or learning rate for gradient descent , an integer threshold maxiter denoting the number of iterations for training, parameters n1,n2, ,nh that denote the number of neurons excluding bias, which will be added automatically for each of the hidden layers l1,2,,h,andthetypeofactivationfunctionsf1,f2,,fh1 foreachofthelayers other than the input layer that uses identity activations. The size of the input d and output p layers is determined directly from D.
The MLP first initializes the nlnl1 weight matrices Wl between layers l and l1 with small values chosen uniformly at random, e.g., in the range 0.01, 0.01. The MLP considers each input pair xi , yi D, and computes the predicted response oi via the feedforward process. The backpropagation phase begins by computing the error between oi and true response yi, and computing the net gradient vector h1 at the output layer. These net gradients are backpropagated from layer h1 to layer h, from h to h1, and so on until we obtain the net gradients at the first hidden layer l1.ThesenetgradientsareusedtocomputetheweightgradientmatrixWl atlayer l, which can in turn be used to update the weight matrix Wl . Likewise, the net gradients specify the bias gradient vector bl at layer l, which is used to update bl. After each point has been used to update the weights, that completes one iteration or epoch of training. The training stops when maxiter epochs have been reached. On the other hand, during testing, for any input x, we apply the feedforward steps and print the predicted output o.
It is important to note that Algorithm 25.2 follows a stochastic gradient descent approach, since the points are considered in random order, and the weight and bias gradients are computed after observing each training point. In practice, it is common
1
flzl1zl
for linear
for sigmoid 1zl zl for tanh

668
Neural Networks
Algorithm 25.2: Deep MLP Training: Stochastic Gradient Descent
1 2
3 4 5
6 7 8
9 10
11 12
13 14 15 16
17 18
19 20 21
22 23 24
25 26
DEEPMLPTRAINING D,h,,maxiter,n1,n2, ,nh,f 1,f 2, ,f h1: n0dinput layer size
nh1poutput layer size
Initialize weight matrices and bias vectors
for l0,1,2, ,h do
blrandom nl1 vector with small values
Wlrandom nlnl1 matrix with small values
t0iteration counter repeat
foreach xi , yi D in random order doFeedForward Phase
z0xi
for l0,1,2,,h do
zl1 fl1bl WTl zl
oizh1
Backpropagation Phase if independent outputs then
h1fh1Exi net gradients at output else
Gradient Descent Step
h1Fh1 Exi net gradients at output for lh, h1,, 1 do
lfl Wll1netgradientsatlayerl
for l0,1, ,h do
Wl zll1Tweightgradientmatrixatlayerl bl l1biasgradientvectoratlayerl
for l0,1, ,h do
WlWlWlupdate Wl blbl blupdate bl
tt1 until tmaxiter
to update the gradients by considering a fixed sized subset of the training points called a minibatch instead of using single points. That is, the training data is divided into minibatches using an additional parameter called batch size, and a gradient descent step is performed after computing the bias and weight gradient from each minibatch. This helps better estimate the gradients, and also allows vectorized matrix operations over the minibatch of points, which can lead to faster convergence and substantial speedups in the learning.

Deep Multilayer Perceptrons 669
One caveat while training very deep MLPs is the problem of vanishing and exploding gradients. In the vanishing gradient problem, the norm of the net gradient can decay exponentially with the distance from the output layer, that is, as we backpropagate the gradients from the output layer to the input layer. In this case the network will learn extremely slowly, if at all, since the gradient descent method will make minuscule changes to the weights and biases. On the other hand, in the exploding gradient problem, the norm of the net gradient can grow exponentially with the distance from the output layer. In this case, the weights and biases will become exponentially large, resulting in a failure to learn. The gradient explosion problem can be mitigated to some extent by gradient thresholding, that is, by resetting the value if it exceeds an upper bound. The vanishing gradients problem is more difficult to address. Typically sigmoid activations are more susceptible to this problem, and one solution is to use alternative activation functions such as ReLU. In general, recurrent neural networks, which are deep neural networks with feedback connections, are more prone to vanishing and exploding gradients; we will revisit these issues in Section 26.2.
Example 25.5 Deep MLP. We now examine deep MLPs for predicting the labels for the MNIST handwritten images dataset that we considered in Example 25.4. Recall that this dataset has n60000 grayscale images of size 2828 that we treat as d784 dimensional vectors. The pixel values between 0 and 255 are converted to the range 0 and 1 by dividing each value by 255. The target response vector is a onehot encoded vector for class labels 0,1,,9. Thus, the input to the MLP xi has dimensionality d784, and the output layer has dimensionality p10. We use softmax activation for the output layer. We use ReLU activation for the hidden layers, and consider several deep models with different number and sizes of the hidden layers. We use step size 0.3 and train for t15 epochs. Training was done using minibatches, using batch size of 1000.
During the training of each of the deep MLPs, we evaluate its performance on the separate MNIST test datatset that contains 10,000 images. Figure 25.12 plots the
n1392
n1 196,n2 49
n1392,n2196,n349
n1392,n2196,n398,n449
5,000
4,000
3,000
2,000
1,000
epochs
Figure 25.12. MNIST: Deep MLPs; prediction error as a function of epochs.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
errors

670 Neural Networks
number of errors after each epoch for the different deep MLP models. The final test error at the end of training is given as
n1392
n1 196,n2 49
n1392,n2196,n349
n1392,n2196,n398,n449
We can observe that as we increase the number of layers, we do get performance improvements. The deep MLP with four hidden layers of sizes n1392, n2196, n398, n449 results in an error rate of 2.78 on the training set, whereas the MLP with a single hidden layer of size n1392 has an error rate of 3.96. Thus, the deeper MLP significantly improves the prediction accuracy. However, adding more layers does not reduce the error rate, and can also lead to performance degradation.
hidden layers
errors
396
303
290
278
25.5 FURTHER READING
Artificial neural networks have their origin in the work of McCulloch and Pitts 1943. The first application of a single neuron, called a perceptron, to supervised learning was by Rosenblatt 1958. Minsky and Papert 1969 pointed out limitations of perceptrons, which were not overcome until the development of the backpropagation algorithm, which was introduced by Rumelhart, Hinton, and Williams 1986 to train general multilayer perceptrons.
McCulloch, W. S. and Pitts, W. 1943. A logical calculus of the ideas immanent in nervous activity. The Bulletin of Mathematical Biophysics, 5 4, 115133.
Minsky, M. and Papert, S. 1969. Perceptron: An Introduction to Computational Geometry. Cambridge, MA: The MIT Press.
Rosenblatt, F. 1958. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65 6, 386.
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. 1986. Learning representations by backpropagating errors. Nature, 323 6088, 533.
25.6 EXERCISES
Q1. Consider the neural network in Figure 25.13. Let bias values be fixed at 0, and let the weight matrices between the input and hidden, and hidden and output layers, respectively, be:
Ww1,w2,w31,1,1 Ww1 ,w2 ,w3 T0.5,1,2T
Assume that the hidden layer uses ReLU, whereas the output layer uses sigmoid activation. Assume SSE error. Answer the following questions, when the input is x4 and the true response is y0:

Exercises
671
z1
w1
x w2 z2 w2 o
w1
w3 Figure 25.13. Neural network for Q1.
a Useforwardpropagationtocomputethepredictedoutput. b Whatisthelossorerrorvalue?
c Computethenetgradientvectorofortheoutputlayer.
d Computethenetgradientvectorhforthehiddenlayer.
e ComputetheweightgradientmatrixW betweenthehiddenandoutputlayers.
f ComputetheweightgradientmatrixWbetweentheinputandhiddenlayers.
Q2. Show that the derivative of the sigmoid function Eq. 25.5 with respect to its
argument is given as
fz fz1fz z
Q3. ShowthatthederivativeofthehyperbolictangentfunctionEq.25.6withrespect to its argument is given as
fz 1fz2 z
Q4. Showthatthederivativeofthesoftmaxfunctionisgivenas fzizfzi1fzi ifji
w3
z3
zj f zi f zjif ji
Q5. Derive an expression for the net gradient vector at the output neurons when using
where zz1,z2, ,zp.
softmax activation with the squared error function.
Q6. Showthatiftheweightmatrixandbiasvectorsareinitializedtozero,thenallneurons in a given layer will have identical values in each iteration.
Q7. Provethatwithlinearactivationfunctions,amultilayerednetworkisequivalenttoa singlelayered neural network.
Q8. Compute the expression for the net gradient vector at the output layer, h1, assuming crossentropy error, Ex Ki1 yilnoi , with K independent binary output neurons that use sigmoid activation, that is, oisigmoidneti .
Q9. GivenanMLPwithonehiddenlayer,derivetheequationsforhandousingvector derivatives, i.e., by computing Ex and Ex , where neth and neto are the net input
nethneto vectors at the hidden and output layers.

CHAPTER 26 Deep Learning
In this chapter, we first look at deep neural networks that include feedback from one layer to another. Such network are called recurrent neural networks RNNs, and they can typically be trained by unfolding or unrolling the recurrent connections, resulting in deep networks whose parameters can be learned via the backpropagation algorithm. Since RNNs are prone to the vanishing and exploding gradients problem, we next consider gated RNNs that introduce a new type of layer that endows the network with the ability to selectively read, store, and write the hidden state via an internal memory layer. Gated RNNs are highly effective for prediction tasks on sequence inputs. Finally, we consider convolutional neural networks CNNs that are deep MLPs that exploit spatial or temporal relationships between different elements of each layer to construct a hierarchy of features that can be used for regression or classification tasks. Unlike regular MLPs whose layers are fully connected, CNNs have layers that are localized and sparse. In particular, CNNs are highly effective for image inputs.
26.1 RECURRENT NEURAL NETWORKS
Multilayer perceptrons are feedforward networks in which the information flows in only one direction, namely from the input layer to the output layer via the hidden layers. In contrast, recurrent neural networks RNNs are dynamically driven e.g., temporal, with a feedback loop between two or more layers, which makes such networks ideal for learning from sequence data. For example, Figure 26.1 shows a simple RNN where there is a feedback loop from the hidden layer ht to itself via a temporal delay of one time unit, denoted by the 1 on the loop.
Let Xx1,x2, ,xdenote a sequence of vectors, where xtRd is a ddimensional vector t1,2, ,. Thus, X is an input sequence of length , with xt denoting the input at time step t. Let Yy1,y2, ,ydenote a sequence of vectors, with ytRp a pdimensional vector. Here, Y is the desired target or response sequence, with yt denoting the response vector at time t. Finally, let Oo1,o2, ,odenote the predicted or output sequence from the RNN. Here otRp is also a pdimensional vector to match the corresponding true response yt . The task of an RNN is to learn a function that predicts the target sequence Y given the input sequence X . That is, the
672

Recurrent Neural Networks
673
Wh,bh 1
xt
ht
Wo,bo
Figure 26.1. Recurrent neural network. Recurrent connection shown in gray.
predicted output ot on input xt should be similar or close to the target response yt , for each time point t.
To learn dependencies between elements of the input sequence, an RNN maintains a sequence of mdimensional hidden state vectors htRm , where ht captures the essential features of the input sequences up to time t, as illustrated in Figure 26.1. The hidden vector ht at time t depends on the input vector xt at time t and the previous hidden state vector ht 1 from time t1, and it is computed as follows:
h tf hW i T x tW hT h t1b h 2 6 . 1
Here, f h is the hidden state activation function, typically tanh or ReLU. Also, we need an initial hidden state vector h0 that serves as the prior state to compute h1. This is usually set to the zero vector, or seeded from a prior RNN prediction step. The matrix WiRdm specifies the weights between the input vectors and the hidden state vectors. The matrix WhRmm specifies the weight matrix between the hidden state vectors at time t1 and t , with bhRm specifying the bias terms associated with the hidden states. Note that we need only one bias vector bh associated with the hidden state neurons; we do not need a separate bias vector between the input and hidden neurons.
Given the hidden state vector at time t, the output vector ot at time t is computed as follows:
ot foWoTht bo 26.2
Here, WoRmp specifies the weights between the hidden state and output vectors, with bias vector bo . The output activation function f o typically uses linear or identity activation, or a softmax activation for onehot encoded categorical output values.
It is important to note that all the weight matrices and bias vectors are independent of the time t. For example, for the hidden layer, the same weight matrix Wh and bias vector bh is used and updated while training the model, over all time steps t. This is an example of parameter sharing or weight tying between different layers or components of a neural network. Likewise, the input weight matrix Wi , the output weight matrix Wo and the bias vector bo are all shared across time. This greatly reduces the number of parameters that need to be learned by the RNN, but it also relies on the assumption that all relevant sequential features can be captured by the shared parameters.
Figure 26.1 shows an illustration of an RNN, with a recurrent hidden layer ht , denoted by the feedback loop with a time delay of 1 noted on the loop. The figure also shows the shared parameters between the layers. Figure 26.2a shows the same
Wi
ot

674
Deep Learning
t 0
t 1
t 2

t1
Wo,bo
Wh,bh
Wi
t
Wo,bo
Wi
ot
o1
o2

o 1
o
Wh,bh
Wo,bo 1
Wi
a RNN
Wo,bo
Wh,bh Wi
Wo,bo
Wi
ht
h0
h1
h2

h 1
h
Wh,bh
xt
x1
x2
b RNN: unfolded in time Figure 26.2. RNN unfolded in time.
RNN in vertical format, where the layers are stacked vertically, and Figure 26.2b shows the RNN with the layers unfolded in time; it shows the input xt , hidden ht , and output otlayers at each time step t . We can observe that the feedback loop has been unfolded fortime steps, starting with t1 and ending at t . The time step t0 is used to denote the previous or initial hidden state h0. We can also explicitly observe the parameter sharing across time, since all weight matrices and bias vectors are independent of t. Figure 26.2b makes it clear that an RNN is a deep neural network withlayers, whereis the maximum input sequence length.
The training data for the RNN is given as DXi,Yini1, comprising n input sequences Xi and the corresponding target response sequences Yi, with sequence length i . Given each pair X , Y D, with Xx1 , x2 ,, xand Yy1 , y2 ,, y , the RNN has to update the model parameters Wi,Wh,bh,Wo,bo for the input, hidden and output layers, to learn the corresponding output sequence Oo1 , o2 ,, o . For training the network, we compute the error or loss between the predicted and response vectors over all time steps. For example, the squared error loss is given as
1
EXExt2ytot 2
t1 t1
On the other hand, if we use a softmax activation at the output layer, then we use the
crossentropy loss, given as

x 1
x
p
EXExtyti lnoti
t1 t1 i1
where ytyt1,yt2, ,ytpTRp and otot1,ot2, ,otpTRp. On training input of lengthwe first unfold the RNN forsteps, following which the parameters can be learned via the standard feedforward and backpropagation steps, keeping in mind the connections between the layers.

Recurrent Neural Networks 675 26.1.1 FeedforwardinTime
The feedforward process starts at time t0, taking as input the initial hidden state vector h0, which is usually set to 0 or it can be userspecified, say from a previous prediction step. Given the current set of parameters, we predict the output ot at each time step t1,2, , via Eq. 26.1 and Eq. 26.2:
o tf oW oT h tb o
f oW oT f hW i T x tW hT h t1b h b o
ht.
foWoTfhWiTxt WhTfhfhWiTx1 WhTh0 bhbhbo
h1
We can observe that the RNN implicitly makes a prediction for every prefix of the inputsequence,sinceot dependsonallthepreviousinputvectorsx1,x2,,xt,butnot on any future inputs xt1, ,x .
26.1.2 Backpropagation in Time
Once the feedforward phase computes the output sequence Oo1 , o2 ,, o , we can compute the error in the predictions using the squared error or crossentropy loss function, which can in turn be used to compute the net gradient vectors that are backpropagated from the output layers to the input layers for each time step.
For the backpropagation step it is easier to view the RNN in terms of the distinct layers based on the dependencies, as opposed to unfolding in time. Figure 26.3a shows the RNN unfolded using this layer view as opposed to the time view shown in Figure 26.2b. The first layer is l0 comprising the hidden states h0 and the input vector x1, both of which are required to compute h1 in layer l1, which also includes the input vector x2. In turn, both h1 and x2 are required to compute h2 in layer l2, and so on for other layers. Note also that o1 is not output until layer l2 since we can compute the output only once h1 has been computed in the previous layer. The layer view is essentially indexed by the hidden state vector index, except the final layer l1thatoutputso thatdependsonh fromlayerl.
Let Ext denote the loss on input vector xt from the input sequence Xx1,x2, ,x . The unfolded feedforward RNN for X has l1 layers, as shown inFigure26.3a.Defineot asthenetgradientvectorfortheoutputvectorot,i.e.,the derivative of the error function Ext with respect to the net value at each neuron in ot , given as
oExt ,Ext ,,Ext T t neto neto neto
t1 t2 tp
whereot ot1,ot2,,otpT Rp isthepdimensionaloutputvectorattimet,andneto
ti is the net value at output neuron oti at time t. Likewise, let ht denote the net gradient

676
Deep Learning
l0
Wh,bh
l1
l2 o1
l3 o2

l o1
l 1 o
h2

Wh,bh
Wh,bh
l 0
Wh h1
l 1
l 2 o1
o1
l 3 o2
o2

h 1 h1
lo1
o1
l1 o
o

a Feedforward step
Wh h2
Wh h
b Backpropagation step Figure 26.3. RNN unfolded as layers.
vectorforthehiddenstateneuronsht attimet
hExt ,Ext ,,Ext T
t neth neth neth t1 t2 tm
where htht1,ht2, ,htmTRm is the mdimensional hidden state vector at time t, and neth is the net value at hidden neuron h at time t. Let fh and fo denote the
ti ti activationfunctionsforthehiddenstateandoutputneurons,andletfth andfto denote
the vector of the derivatives of the activation function with respect to the net signal that is, its argument for the hidden and output neurons at time t, given as
fhneth fhneth fhneth T fh t1 , t2 ,, tm

h0
h1
h 1

h
x1
x2
x 1
x
h1 h1
h2 h2
h h
h0

x1
x2

x 1
x
t neth neth neth t1 t2 tm
Wo,bo
Wi
Wo,bo
Wo,bo
Wi
Wo,bo
Wi
Wi
W o o2
W o o
W o 1o
W o o1

Recurrent Neural Networks 677
foneto foneto fonetoT fo t1 , t2 ,, tp
Finally, let Ext respect to ot :
denote the vector of partial derivatives of the error function with
Ext Ext ,Ext , , Ext T ot1 ot2 otp
t neto neto neto t1 t2 tp
Computing Net Gradients
The key step in backpropagation is to compute the net gradients in reverse order, starting from the output neurons to the input neurons via the hidden neurons. Given the layer view in Figure 26.3a, the backpropagation step reverses the flow direction for computing the net gradients ot and ht , as shown in the backpropagation graph in Figure 26.3b.
Inparticular,thenetgradientvectorattheoutputot canbecomputedasfollows:
ot f t o E x t2 6 . 3
whereis the elementwise or Hadamard product, For example, if Ext is the squared error function, and the output layer uses the identity function, then via Eq. 25.8 and Eq. 25.15 we have
ot1 o ty t
On the other hand, the net gradients at each of the hidden layers need to account for the incoming net gradients from ot and from ht1 as seen in Figure 26.3b. Thus, generalizingEq.25.53,thenetgradientvectorforht fort1,2,,1isgiven as
ht fth WootWhht1 26.4
Notethatforh,itdependsonlyono seeFigure26.3b,thereforeh fh W o o
For the tanh activation, which is commonly used in RNNs, the derivative of the activationfunctionseeEq.25.11withrespecttothenetvaluesatht isgivenas
fth 1ht ht
Finally, note that the net gradients do not have to be computed for h0 or for any of the input neurons xt , since these are leaf nodes in the backpropagation graph, and thus do not backpropagate the gradients beyond those neurons.
Stochastic Gradient Descent
The net gradients for the output ot and hidden neurons ht at time t can be used to compute the gradients for the weight matrices and bias vectors at each time point.

678 Deep Learning
However, since an RNN uses parameter sharing across time, the gradients are obtained bysummingupallofthecontributionsfromeachtimestept.DefinetWo andtbo asthe gradientsoftheweightsandbiasesbetweenthehiddenneuronsht andoutputneurons ot fortimet.Usingthebackpropagationequations,Eq.25.47andEq.25.48,for deep multilayer perceptrons, these gradients are computed as follows:
botboot
t1 t1
Wo
T tWohtot
t1 t1
Likewise, the gradients of the other shared parameters between hidden layers ht1 and ht,andbetweentheinputlayerxt andhiddenlayerht,areobtainedasfollows:

ht
WitWixtht
t1 t1
b h
t1
W h
t1
t1
t1
T h t1 ht
t b h
T
t W h
where tWh and tbh are the gradient contributions from time t for the weights and biases for the hidden neurons, and tWi the gradient contributions for the weights for the input neurons. Finally, we update all the weight matrices and bias vectors as follows
Wi Wi Wi Wh Wh Wh bh bh bh WoWoWo bobobo
whereis the gradient step size or learning rate. 26.1.3 Training RNNs
26.5
Algorithm 26.1 shows the pseudocode for learning the weights and biases for an RNN. TheinputscomprisethedatasetDXi,Yii1,,n,thestepsizeforgradientdescent, an integer threshold maxiter denoting the number of iterations for training, the size of the hidden state vectors m, and the activation functions f o and f h for the output and hidden layers. The size of the input d and output p layers is determined directly from D. For simplicity we assume that all inputs Xi have the same length , which determines the number of layers in the RNN. It is relatively easy to handle variable length input sequences by unrolling the RNN for different input lengths i .
The RNN first initializes the weight matrices and bias vectors with random values drawn uniformly from a small range, e.g., 0.01,0.01. The RNN considers each input pair X , Y D, and computes the predicted output ot for each time step via the feedforward process. The backpropagation phase begins by computing the error betweenot andtrueresponseyt,andthenthenetgradientvectorot attheoutputlayer for each time step t. These net gradients are backpropagated from the output layers at time t to the hidden layers at time t , which are in turn used to compute the net gradients ht at the hidden layers for all time steps t1,2, ,. Note that Line 15 shows the case where the output layer neurons are independent; if they are not independent we can replace it by Fo Ext see Eq. 25.51. Next, we compute the weight gradient

Recurrent Neural Networks 679 Algorithm 26.1: RNN Training: Stochastic Gradient Descent
RNNTRAINING D,,maxiter,m,f o,f h:
Initialize bias vectors
1 bhrandom mdimensional vector with small values
2 borandom pdimensional vector with small values
Initialize weight matrix
3 Wirandom dm matrix with small values
4 Whrandom mm matrix with small values
5 Worandom mp matrix with small values
6 r0iteration counter
7 repeat
8 foreach X,YD in random order do
9 X length of training sequence
FeedForward Phase
10 h00Rminitialize hidden state
11 for t1,2,, do
12 h tf hW i T x tW hT h t1b h
13 o tf oW oT h tb o
Backpropagation Phase
14 for t , 1, . . . , 1 do
15 otftoExtnet gradients at output
16 h fth Wootnetgradientsath
17 for t1, 2,, 1 do
18 ht fth WootWhht1netgradientsatht
Gradients of weight matrices and bias vectors
19 bot1ot; Wot1htotT
20 bh t1ht ; Wh t1ht1 ht T;Gradient Descent Step
21 bobobo; WoWoWo
22 bh bh bh; Wh Wh Wh;
23 rr1
24 until rmaxiter
Wi t1xt ht T Wi Wi Wi
matrices,Wi,Wh,andWo,andbiasgradientvectors,bh andbo.Thesegradients are used to update the weights and biases via stochastic gradient descent. After each point has been used to update the weights, that completes one epoch of training. The training stops when maxiter epochs have been reached.
Note that, whereas Algorithm 26.1 shows the pseudocode for stochastic gradient descent, in practice, RNNs are trained using subsets or minibatches of input sequences instead of single sequences. This helps to speed up the computation and convergence of gradient descent, since minibatches provide better estimates of the bias and weight gradients and allow the use of vectorized operations.

680
Deep Learning
S
2X4 TS
0B1XP6E7 PV
3V5
T
Figure 26.4. Reber grammar automata.
Example 26.1 RNN. We use an RNN to learn the Reber grammar, which is generated according to the automata shown in Figure 26.4. Let B,E,P,S,T,V,X
denote the alphabet comprising the seven symbols. Further, let
symbol. Starting from the initial node, we can generate strings that follow the Reber grammar by emitting the symbols on the edges. If there are two transitions out of a node, each one can be chosen with equal probability. The sequence B,T,S,S,X,X,T,V,V,E is a valid Reber sequence with the corresponding state sequence 0,1,2,2,2,4,3,3,5,6,7. On the other hand, the sequence B,P,T,X,S,E is not a valid Reber sequence, since there is no edge out of state 3 with the symbol X.
The task of the RNN is to learn to predict the next symbol for each of the positions in a given Reber sequence. For training, we generate Reber sequences from theautomata.LetSX s1,s2,,sbeaRebersequence.Thecorrespondingtrue output Y is then given as the set of next symbols from each of the edges leaving the state corresponding to each position in SX . For example, consider the Reber sequence SXB,P,T,V,V,E, with the state sequence 0,1,3,3,5,6,7. The
, where
the terminal symbol. Here, PT denotes that the next symbol can be either P or T. We can see that SY comprises the sequence of possible next symbols from each of the
desired output sequence is then given as SYPT,TV,TV,PV,E,

states inexcluding the start state 0.
To generate the training data for the RNN, we have to convert the symbolic
Reber strings into numeric vectors. We do this via a binary encoding of the symbols, as follows:

denote a terminal
is
B E P S T V X
1,0,0,0,0,0,0T 0,1,0,0,0,0,0T 0,0,1,0,0,0,0T 0,0,0,1,0,0,0T 0,0,0,0,1,0,0T 0,0,0,0,0,1,0T 0,0,0,0,0,0,1T 0,0,0,0,0,0,0T

Recurrent Neural Networks 681
That is, each symbol is encoded by a 7dimensional binary vector, with a 1 in the column corresponding to its position in the ordering of symbols in . The terminal symbolis not part of the alphabet, and therefore its encoding is all 0s. Finally, to encode the possible next symbols, we follow a similar binary encoding with a 1 in the column corresponding to the allowed symbols. For example, the choice PT is encoded as 0,0,1,0,1,0,0T. Thus, the Reber sequence SX and the desired output sequence SY are encoded as:
X
Y
x1 x2 x3 x4 x5 x6
y1 y2 y3 y4 y5 y6

BPTVVE
PT TV TV PV E
B E P S T V X
100000 000001 010000 000000 001000 000110 000000
000000 000010 100100 000000 111000 011100 000000
For training, we generate n400 Reber sequences with a minimum length of 30. The maximum sequence length is 52. Each of these Reber sequences is used to create a training pair X,Y as described above. Next, we train an RNN with m4 hidden neurons using tanh activation. The input and ouput layer sizes are determined by the dimensionality of the encoding, namely d7 and p7. We use a sigmoid activation at the output layer, treating each neuron as independent. We use the binary cross entropy error function. The RNN is trained for r10000 epochs, using gradient step size 1 and the entire set of 400 input sequences as the batch size. The RNN model learns the training data perfectly, making no errors in the prediction of the set of possible next symbols.
We test the RNN model on 100 previously unseen Reber sequences with minimum length 30, as before. The RNN makes no errors on the test sequences. On the other hand, we also trained an MLP with a single hidden layer, with size m varying between 4 and 100. Even after r10000 epochs, the MLP is not able to correctly predict any of the output sequences perfectly. It makes 2.62 mistakes on average per sequence for both the training and testing data. Increasing the number of epochs or the number of hidden layers does not improve the MLP performance.
26.1.4 Bidirectional RNNs
AnRNNmakesuseofahiddenstateht thatdependsontheprevioushiddenstateht1 and the current input xt at time t. In other words, it only looks at information from the past. A bidirectional RNN BRNN, as shown in Figure 26.5, extends the RNN model to also include information from the future. In particular, a BRNN maintains a backward hidden state vector btRm that depends on the next backward hidden state bt1 and the current input xt. The output at time t is a function of both ht and bt. In

682
Deep Learning
t 0
t 1
Wbo Who, bo
t 2
Wbo Who, bo
t1 Wbo Who, bo
Wb,bb Wih Wib Wih
t1
Wh,bh
o1
o2

t

o 1
o
Wh,bh
Wbo Who, bo Wh,bh
h0
h1

h 1
h
b 1
b1
h2
b2

b 1
b
Wb,bb Wih Wib Wih
Wb,bb
Wib
Wib
x1
x2

Figure 26.5. Bidirectional RNN: Unfolded in time.
particular, we compute the forward and backward hidden state vectors as follows:
x 1
x
hfhWTxWTh b t ihtht1h
bt fbWTxt WTbt1 bb ib b
26.6
Also, a BRNN needs two initial state vectors h0 and b1 to compute h1 and b, respectively. These are usually set to 0Rm. The forward and backward hidden states are computed independently, with the forward hidden states computed by considering the input sequence in the forward direction x1,x2, ,x , and with the backward hidden states computed by considering the sequence in reverse order x,x1,,x1. The output at time t is computed only when both ht and bt are available, and is given as
ot foWT ht WT bt bo ho bo
It is clear that BRNNs need the complete input before they can compute the output. We can also view a BRNN as having two sets of input sequences, namely the forward input sequence Xx1,x2,,x and the reversed input sequence Xr x,x1,,x1,withthecorrespondinghiddenstatesht andbt,whichtogether determine the output ot . Thus, a BRNN is comprised of two stacked RNNs with independent hidden layers that jointly determine the output.
26.2 GATED RNNS: LONG SHORTTERM MEMORY NETWORKS
One of the problems in training RNNs is their susceptibility to either the vanishing gradient or the exploding gradient problem. For example, consider the task of computing the net gradient vector ht for the hidden layer at time t, given as
ht fthWootWhht1

Gated RNNS: Long ShortTerm Memory Networks 683
Assume for simplicity that we use a linear activation function, i.e., fth1, and let us ignore the net gradient vector for the output layer, focusing only on the dependence on the hidden layers. Then for an input sequence of length, we have
hWh WWh W2h Wth t h t1 h h t2 h t2 h
Inparticular,forexample,attimet1,wehaveh W1h.Wecanobservethat 1h
the net gradient from timeaffects the net gradient vector at time t as a function of Wt, i.e., as powers of the hidden weight matrix W . Let the spectral radius of W ,
hhh defined as the absolute value of its largest eigenvalue, be given as 1. It turns out that if 11, then Whk0 as k, that is, the gradients vanish as we train on long sequences. On the other hand, if 11, then at least one element of Whk becomes unbounded and thus Whkas k, that is, the gradients explode as we train on long sequences. To see this more clearly, let the eigendecomposition of the square mmmatrixWh begivenas
1 0 W U0 2
0
0U1
. m
where 12m are the eigenvalues of Wh, and U is the matrix comprising the corresponding eigenvectors, u1,u2, ,um, as columns. Thus, we have
0
0U1
. km
It is clear that the net gradients scale according to the eigenvalues of Wh. Therefore, if11,then1k 0ask,andsince1iforalli1,2,,m,then necessarily i k0 as well. That is, the gradients vanish. On the other hand, if 1 1, then 1 k as k, and the gradients explode. Therefore, for the error to neither vanish nor explode, the spectral radius of Wh should remain 1 or very close to it.
Long shortterm memory LSTM networks alleviate the vanishing gradients problem by using gate neurons to control access to the hidden states. Consider the mdimensional hidden state vector htRm at time t. In a regular RNN, we update the hidden state as follows as per Eq. 26.1:
h tf hW i T x tW hT h t1b h
Let g0, 1m be a binary vector. If we take the elementwise product of g and ht , namely, ght , then elements of g act as gates that either allow the corresponding element of ht to be retained or set to zero. The vector g thus acts as logical gate that allows selected elements of ht to be remembered or forgotten. However, for backpropagation we need differentiable gates, for which we use sigmoid activation on the gate neurons so that their value lies in the range 0,1. Like a logical gate, such neurons allow the inputs to be completely remembered if the value is 1, or
h. . 0 0
k1 0 WkU0k2
h. . 0 0

684 Deep Learning forgotten if the value is 0. In addition, they allow a weighted memory, allowing partial
remembrance of the elements of ht , for values between 0 and 1.
Example26.2DifferentiableGates. Asanexample,considerahiddenstatevector ht0.94 1.05 0.39 0.97 0.90T
First consider a logical gate vector
g0 1 1 0 1T
Their elementwise product gives
ght 0 1.05 0.39 0 0.90T
We can see that the first and fourth elements have been forgotten. Now consider a differentiable gate vector
g0.1 0 1 0.9 0.5T The elementwise product of g and ht gives
ght0.094 0 0.39 0.873 0.45T
Now, only a fraction specified by an element of g is retained as a memory after the elementwise product.
26.2.1 Forget Gate
To see how gated neurons work, we consider an RNN with a forget gate. Let htRm be the hidden state vector, and let tRm be a forget gate vector. Both these vectors have the same number of neurons, m.
In a regular RNN, assuming tanh activation, the hidden state vector is updated unconditionally, as follows:
h tt a n hW i T x tW hT h t1b h
Instead of directly updating ht , we will employ the forget gate neurons to control how much of the previous hidden state vector to forget when computing its new value, and also to control how to update it in light of the new input xt .
Figure26.6showsthearchitectureofanRNNwithaforgetgate.Giveninputxt and previous hidden state ht 1 , we first compute a candidate update vector ut , as follows:
ut tanhWTxt WT ht1 bu 26.7 u hu
The candidate update vector ut is essentially the unmodified hidden state vector, as in a regular RNN.

Gated RNNS: Long ShortTerm Memory Networks
685
ot
Wo,bo
ht
Wh 1
1
1t
1Whu
t
ut
W,b
Wu,bu
Figure 26.6. RNN with a forget gate t. Recurrent connections shown in gray, forget gate shown doublelined.denotes elementwise product.
Using the forget gate, we can compute the new hidden state vector as follows:
ht t ht1 1tut 26.8
Hereis the elementwise product operation. We can see that the new hidden state vector retains a fraction of the previous hidden state values, and a complementary fraction of the candidate update values. Observe that if t0, i.e., if we want to entirely forgettheprevioushiddenstate,then1t 1,whichmeansthatthehiddenstatewill be updated completely at each time step just like in a regular RNN. Finally, given the hiddenstateht wecancomputetheoutputvectorot asfollows
o tf oW oT h tb o
How should we compute the forget gate vector t ? It makes sense to base it on
xt
both the previous hidden state and the new input value, so we compute it as follows:WTx WT h b26.9
where we use a sigmoid activation function, denoted, to ensure that all the neuron values are in the range 0, 1, denoting the extent to which the corresponding previous hidden state values should be forgotten.
To summarize, a forget gate vector t is a layer that depends on the previous hidden state layer ht1 and the current input layer xt; these connections are fully connected, and are specified by the corresponding weight matrices Wh and W , and the bias vector b. On the other hand, the output of the forget gate layer t needs to modify the previous hidden state layer ht1, and therefore, both t and ht1 feed into what is essentially a new elementwise product layer, denoted byin Figure 26.6.
t tht1

686 Deep Learning
Finally, the output of this elementwise product layer is used as input to the new hidden layerht thatalsotakesinputfromanotherelementwisegatethatcomputestheoutput fromthecandidateupdatevectorut andthecomplementedforgetgate,1t.Thus, unlike regular layers that are fully connected and have a weight matrix and bias vector between the layers, the connections between t and ht via the elementwise layer are all onetoone, and the weights are fixed at the value 1 with bias 0. Likewise the connections between ut and ht via the other elementwise layer are also onetoone, with weights fixed at 1 and bias at 0.
Example 26.3. Let m5. Assume that the previous hidden state vector and the candidate update vector are given as follows:
ht10.94 1.05 0.39 0.97 0.9T ut0.5 2.5 1.0 0.5 0.8T Let the forget gate and its complement be given as follows:
t0.9 1 0 0.1 0.5T 1t 0.1 0 1 0.9 0.5T
The new hidden state vector is then computed as the weighted sum of the previous hidden state vector and the candidate update vector:
ht t ht1 1tut
0.9 1 0 0.1 0.1 0 1 0.9
0.5T0.94 1.05 0.39 0.97 0.9T
0.5T0.5 2.5 1.0 0.5 0.8T
0 0.097 0.45T0.05 0 1.0 0.45 0.40T
0.846 1.05
0.796 1.05 1.0 0.353 0.85T
Computing Net Gradients
It is instructive to compute the net gradients for an RNN with a forget gate, since a similar approach is used to compute the net gradients when training an LSTM network. An RNN with a forget gate has the following parameters it needs to learn, namely the weight matrices Wu, Whu, W, Wh, and Wo, and the bias vectors bu, b and bo. The computation of the hidden state vector ht adds together the inputs from the new elementwise layer that multiplies its incoming edges to compute the net input as opposed to computing a weighted sum. We will look at how to account for the elementwise layers during backpropagation.
Figure 26.7 shows a forget gate RNN unfolded in time for two time steps. Let ot , ht , t , and ut denote the net gradient vectors at the output, hidden, forget gate, and candidate update layers, respectively. During backpropagation, we need to compute the net gradients at each layer. The net gradients at the outputs are computed by

Gated RNNS: Long ShortTerm Memory Networks 687
o1
Wo,bo Wo,bo

h0
h1
h2
1
11 12
u1
2
u2
Wh Whu
Wh
Whu
x1
W,b
Figure 26.7. RNN with a forget gate unfolded in time recurrent connections in gray.
considering the partial derivatives of the activation function fto and the error function Ext :
ot f t o E x t
For the other layers, we can reverse all the arrows to determine the dependencies
between the layers. Therefore, to compute the net gradient for the update layer
ut , notice that in backpropagation it has only one incoming edge from ht via the
elementwise product 1 u . The net gradient u at update layer neuron i at tt ti

time t is given as
uxxtiti h 11u2
Wu,bu
W,b
Wu,bu
E E neth u
ti netu neth uti netu ti ti ti
x2
ti ti ti neth
ti ti ht1,i 1tiuti1ti, and we use the fact that the uti uti
where
update layer uses a tanh activation function. Across all neurons, we obtain the net gradient at ut as follows:
ut ht 1t1ut ut
To compute the net gradient vector for the forget gate, we observe from
Figure 26.7 that there are two incoming flows into t during backpropagationone
from ht via the elementwise product tht1, and the other also from ht via the
elementwise product 1 u . Therefore, the net gradientat forget gate neuron tt ti
i at time t is given as
E E neth
xxtiti h h u1
ti net neth ti net ti t1,i ti ti ti ti ti ti
o2

688 Deep Learning neth
where ti ti ht1,i 1tiutiht1,i uti, and we use the fact that the ti ti
forget gate uses a sigmoid activation function. Across all neurons, we obtain the net gradientatt asfollows:
t ht ht1 utt 1t
Finally, let us consider how to compute ht , the net gradient at the hidden layer at time t. In Figure 26.7 we can observe that if we reverse the arrows, ht depends on the gradients at the output layer ot, the forget gate layer t1, the update layer ut1, and on the hidden layer ht1 via the elementwise product ht1t1. The output, forget and update layers are treated as in a regular RNN. However, due to the elementwise layer, the flow from ht1 is handled as follows:
E neth h
xtt1,iti h1h
neth hti neth t1,i t1,i t1,i t1,i t1,i ti
neth
t1,i t1,i hti 1t1,iut1,it1,i, and we used the fact that
where
ht implicitly uses an identity activation function. Across all the hidden neurons at time t, the net gradient vector component from ht1 is given as ht1t1. Considering all the layers, including the output, forget, update and elementwise layers, the complete net gradient vector at the hidden layer at time t is given as:
ht Woot Wht1 Whuut1 ht1 t1
Given the net gradients, we can compute the gradients for all the weight matrices and bias vectors in a manner similar to that outlined for a regular RNN in Section 26.1.2. Likewise, stochastic gradient descent can be used to train the network.
26.2.2 Long ShortTerm Memory LSTM Networks
We now describe LSTMs, which use differentiable gate vectors to control the hidden state vector ht , as well as another vector ctRm called the internal memory vector. In particular, LSTMs utilize three gate vectors: an input gate vector tRm, a forget gate vector tRm, and an output gate vector tRm, as illustrated in Figure 26.8, which shows the architecture of an LSTM network. Like a regular RNN, an LSTM also maintains a hidden state vector for each time step. However, the content of the hidden vector is selectively copied from the internal memory vector via the output gate, with the internal memory being updated via the input gate and parts of it forgotten via the forget gate.
Let Xx1,x2, ,xdenote a sequence of ddimensional input vectors of length , Yy1,y2, ,ythe sequence of pdimensional response vectors, and Oo1 , o2 ,, othe pdimensional output sequence from an LSTM. At each time step t, the three gate vectors are updated as follows:
26.10
hti hti
t WTxt WT ht1 bh
WTx WT h bt tht1
tWTxtWT ht1bh

Gated RNNS: Long ShortTerm Memory Networks
689
1 Wh
1
1
Wo,bo

tanh
1
Wh
Whu
Wh
1

Wu,bu
W,b
W,b
W,b
Figure 26.8. LSTM neural network. Recurrent connections shown in gray, gate layers shown doublelined.
Heredenotes the sigmoid activation function. We can observe that each gate is a function of the input vector xt at time t, as well as the hidden state ht1 from the previous time step. Each gate vector has a corresponding weight matrix from the input neurons to the gate neurons, and from the hidden state neurons to the gate neurons, as well as a corresponding bias vector. Each of the gate vectors conceptually plays a different role in an LSTM network. The input gate vector t controls how much of the input vector, via the candidate update vector ut , is allowed to influence the memory vectorct.Theforgetgatevectort controlshowmuchofthepreviousmemoryvector to forget, and finally the output gate vector t controls how much of the memory state is retained for the hidden state.
Given the current input xt and the previous hidden state ht1, an LSTM first computesacandidateupdatevectorut afterapplyingthetanhactivation:
ut tanhWTxt WT ht1 bu 26.11 u hu
It then then applies the different gates to compute the internal memory and hidden state vectors:
ct t ut t ct1 ht t tanhct
26.12
ot
ht
ct
t
t
ut
t
xt

690 Deep Learning
The memory vector ct at time t depends on the current update vector ut and the previous memory ct1. However, the input gate t controls the extent to which ut influences ct, and the forget gate t controls how much of the previous memory is forgotten.Ontheotherhand,thehiddenstateht dependsonatanhactivatedinternal memory vector ct , but the output gate t controls how much of the internal memory is reflected in the hidden state. Besides the input vectors xt , the LSTM also needs an initial hidden state vector h0 and an initial memory state vector c0, both typically set to 0Rm.
Finally, the output of the network ot is obtained by applying the output activation function f o to an affine combination of the hidden state neuron values:
o tf oW oT h tb o
LSTMs can typically handle long sequences since the net gradients for the internal memory states do not vanish over long time steps. This is because, by design, the memory state ct1 at time t1 is linked to the memory state ct at time t via implicit weights fixed at 1 and biases fixed at 0, with linear activation. This allows the error to flow across time steps without vanishing or exploding.
LSTMs can be trained just like regular RNNs by unfolding the layers in time, as illustrated in Figure 26.9, which shows the unfolded layers for two time steps. The first step in training is to use the feedforward steps to compute the error, followed by the backpropagation of the gradients as a second step. The latter has to be modified to accommodatetheelementwiseoperationsusedtoupdatethememorystatect andthe hidden state ht. The connections from ct1 to ct starting from c0 to c, which can be thought of as using unit weight matrices and zero biases, appear as a straight line in the figure indicating that the internal memory state can flow across longer periods of time without the gradients vanishing or exploding.
26.2.3 Training LSTMs
Consider the unfolded LSTM in Figure 26.9. During backpropagation the net gradient vector at the output layer at time t is computed by considering the partial derivatives of the activation function, fto, and the error function, Ext , as follows:
ot f t o E x t
where we assume that the output neurons are independent.
In backpropagation there are two incoming connections to the internal memory
vector ct, one from ht and the other from ct1. Therefore, the net gradient c at the ti
internal memory neuron i at time t is given as
E E neth c E netc c
c xxti tixt1,i ti ti netc neth cti netc netc cti netc
ti ti
ti t1,i ti
h1c2c
ti ti ti t1,i t1,i

Gated RNNS: Long ShortTerm Memory Networks 691
o1
o2
Wo,bo Wo,bo

tanh tanh
h0
h1
h2
Wh Wh Whu
Wh Wh Wh Whu Wh
c0
c1
c2
1
1
u1
1
2
2
u2
2
W,b

W,b Wu,bu W,b W,b W,b Wu,bu
W,b
x1
Figure 26.9. LSTM neural network unfolded in time recurrent connections in gray.
where we use the fact that the internal memory vector implicitly uses an identity activation function, and furthermore
neth
ti

cti ti
ti tanhctiti1c2 netc
x2
cti
t1,it1,i ut1,i t1,i cti t1,i cti cti
The net gradient vector ct at ct is therefore given as:
ct ht t1ctctct1t1
The forget gate has only one incoming edge in backpropagation, from ct, via the elementwise multiplication tct1, with sigmoid activation, therefore the net gradient is:
E E netc
xxti ti cc 1
net ti t1,i ti ti ti ti ti
ti net netc ti
where we used the fact that the forget gate uses sigmoid activation and
netc
titi uti ti ct1,i ct1,i
ti ti
Across all forget gate neurons, the net gradient vector is therefore given as t ct ct1 1tt

692 Deep Learning
The input gate also has only one incoming edge in backpropagation, from ct , via the elementwise multiplication tut , with sigmoid activation. In a similar manner, as outlined above for t , the net gradient t at the input gate t is given as:
t ct ut 1tt
The same reasoning applies to the update candidate ut , which also has an incoming edge from ct via tut and tanh activation, so the net gradient vector ut at the update layer is
ut ct t 1u tu t
Likewise, in backpropagation, there is one incoming connection to the output gate
from ht via ttanhctwith sigmoid activation, therefore t ht tanhct1tt
Finally, to compute the net gradients at the hidden layer, note that gradients flow back to ht from the following layers: ut1,t1,t1,t1 and ot. Therefore, the net gradient vector at the hidden state vector ht is given as
thWoot Wht1Wht1Wht1Whuut1
The gradients for the weight matrix and bias vector at the output layer are given
as:
given as follows:
b t1t
b t1t b t1t bu t1ut
W t1xttT W t1xt t T W t1xt t T Wu t1xt ut T
b o
t1
W o
h t ot
Wh t1ht1tT
Wh t1ht1 t T Wh t1ht1 t T Whu t1ht1 ut T

T
ot
Likewise, the gradients for the weight matrices and bias vectors for the other layers are
Given these gradients, we can use the stochastic gradient descent approach to train the network.
t1
Example 26.4 LSTM. We use an LSTM to learn the embedded Reber grammar, which is generated according to the automata shown in Figure 26.10. This automata has two copies of the Reber automata from Example 26.1. From the state s1, the top automata is reached by following the edge labeled T, whereas the bottom automata is reached via the edge labeled P. The states of the top automata are labeled as t0,t1, ,t7, whereas the states of the bottom automata are labeled as p0,p1, ,p7. Finally, note that the state e0 can be reached from either the top or the bottom

Gated RNNS: Long ShortTerm Memory Networks 693 S
X
t2 TS
BE
t0 t1 Pt6 t7 X
PV TT
t4
t3
V
t5
BTE
s0 s1 S
e0 e1
X
p4 TS
BE
p0 p1 Pp6 p7 X
PV
p2 PP
p3
T
V
p5
Figure 26.10. Embedded Reber grammar automata.
automata by following the edges labeled T and P, respectively. The first symbol is always B and the last symbol is always E. However, the important point is that the second symbol is always the same as the second last symbol, and thus any sequence learning model has to learn this long range dependency. For example, the following is a valid embedded Reber sequence: SXB,T,B,T,S,S,X,X,T,V,V,E,T,E.
The task of the LSTM is to learn to predict the next symbol for each of the positions in a given embedded Reber sequence. For training, we generate n400 embedded Reber sequences with a minimum length of 40, and convert them into training pairs X,Y using the binary encoding described in Example 26.1. The maximum sequence length is 64.
Given the long range dependency, we used an LSTM with m20 hidden neurons smaller values of m either need more epochs to learn, or have trouble learning the grammar. The input and output layer sizes are determined by the dimensionality of encoding, namely d7 and p7. We use sigmoid activation at the output layer, treating each neuron as independent. Finally, we use the binary cross entropy error function. The LSTM is trained for r10000 epochs using step size 1 and batch size 400; it learns the training data perfectly, making no errors in the prediction of the set of possible next symbols.
We test the LSTM model on 100 previously unseen embedded Reber sequences with minimum length 40, as before. The trained LSTM makes no errors on the test sequences. In particular, it is able to learn the long range dependency between the second symbol and the second last symbol, which must always match.
The embedded Reber grammar was chosen since an RNN has trouble learning the long range dependency. Using an RNN with m60 hidden neurons, using r25000 epochs with a step size of 1, the RNN can perfectly learn the training sequences. That is, it makes no errors on any of the 400 training sequences. However,

694 Deep Learning
on the test data, this RNN makes a mistake in 40 out of the 100 test sequences. In fact, in each of these test sequences it makes exactly one error; it fails to correctly predict the second last symbol. These results suggest that while the RNN is able to memorize the long range dependency in the training data, it is not able to generalize completely on unseen test sequences.
26.3 CONVOLUTIONAL NEURAL NETWORKS
A convolutional neural network CNN is essentially a localized and sparse feedfor ward MLP that is designed to exploit spatial andor temporal structure in the input data. In a regular MLP all of the neurons in layer l are connected to all of the neurons in layer l1. In contrast, a CNN connects a contiguous or adjacent subset of neurons in layer l to a single neuron in the next layer l1. Different sliding windows comprising contiguous subsets of neurons in layer l connect to different neurons in layer l1. Furthermore, all of these sliding windows use parameter sharing, that is, the same set of weights, called a filter, is used for all sliding windows. Finally, different filters are used to automatically extract features from layer l for use by layer l1.
26.3.1 Convolutions
We begin by defining the convolution operation for oneway, twoway and threeway inputs. By oneway we mean data in the form of a single vector, by twoway we mean data in the form of a matrix, and by threeway we mean data in the form of a tensor. We also call them 1D, 2D or 3D inputs where the dimensionality refers to the number of axes in the input data. We will discuss the convolution operation in the context of the input layer and the first hidden layer, but the same approach can be applied to subsequent layers in the network.
1D Convolution
Let xx1,x2, ,xnT be an input vector a oneway or 1D input with n points. It is assumed that the input points xi are not independent, but rather, there are dependencies between successive points. Let ww1 , w2 ,, wk T be a vector of weights, called a 1D filter, with kn. Here k is also called the window size. Let xki denote the window of x of length k starting at position i, given as
xkixi,xi1,xi2,,xik1T
with 1ink1. Given a vector aRk , define the summation operator as one that
adds all the elements of the vector. That is,
k
suma
A 1D convolution between x and w, denoted by the asterisk symbol , is defined as
T xw sum xk1wsum xknk1w
i1
ai

Convolutional Neural Networks
695
1
3
1
2
3
1
2
1
3
1
2
3
1
2
1
3
1
2
3
1
2
1
3
1
2
3
1
2
1
0
2
1
0
2
1
1
1
1
1
0
2
7
7
7
1
0
2
5
5
4
xwxxxx w
1
3
1
2
3
1
2

w

w

d sumx3 4w
k
sum xkiwxij1 wj 26.13
j1 fori1,2,,nk1.WecanseethattheconvolutionofxRn andwRk results
in a vector of length nk1.
b sumx3 2w
Figure26.11. 1DConvolution:aeshowtheconvolutionbetweendifferentslidingwindowsofxandthe
c sumx3 3w
filter w with window size k3. The final convolution output is shown in e.
a sumx3 1w
whereis the elementwise product, so that

e sumx3 5w
w
1
7
5
1
0
2
4
1
Example 26.5 1D Convolution. Figure 26.11 shows a vector x with n7 and a filter w1,0,2T with window size k3. The first window of x of size 3 is x311,3,1T. Therefore, as seen in Figure 26.11a, we have
sumx31wsum1,3,1T 1,0,2Tsum1,0,2T1
The convolution steps for different sliding windows of x with the filter w are shown in Figure 26.11ae. The convolution xw has size nk17315, and is given as
xw1,7,5,4,1T
2D Convolution
We can extend the convolution operation to matrix input, for example for images. Let X be an nn input matrix, and let W be a kk matrix of weights, called a 2D filter, with kn. Here k is called the window size. Let Xk i, jdenote the kk submatrix of X starting at row i and column j , defined as follows:
x i , j x i , j1x i , jk1X i,j xi1,j xi1,j1xi1,jk1k. ..
xik1,j xik1,j 1xik1,j k1

696 Deep Learning with 1i, jnk1. Given a kk matrix ARkk , define the summation operator
as one that adds all the elements of the matrix. That is,
sumA
whereai,j istheelementofAatrowiandcolumnj.The2DconvolutionofXandW,
denoted XW, is defined as:
sumXk1,1WsumXk1,nk1W
sumXk2,1WsumXk2,nk1WXW ..sumXknk1,1WsumXknk1,nk1W
whereis the elementwise product of Xk i, jand W, so thatkk
k k i1 j1
ai,j
sum Xki,jWxia1,jb1 wa,b 26.14 a1 b1
fori,j1,2,,nk1.TheconvolutionofXRnn andWRkk resultsina nk1nk1 matrix.
Example 26.6 2D Convolution. Figure 26.12 shows a matrix X with n4 and a filter W with window size k2. The convolution of the first window of X, namely X21,1, with W is given as see Figure 26.12a
sumX21,1Wsum1 21 0sum1 02 310101
The convolution steps for different 22 sliding windows of X with the filter W are shown in Figure 26.12ai. The convolution XW has size 33, since nk1
4213, and is given as
2 6 4 XW4 4 8
444
3D Convolution
We now extend the convolution operation to a threedimensional matrix, which is also called a 3D tensor. The first dimension comprises the rows, the second the columns, and the third the channels. Let X be an nnm tensor, with n rows, n columns and m channels. The assumption is that the input X is a collection of nn matrices obtained by applying m filters, which specify the m channels. For example, for nn image inputs, each channel may correspond to a different color filterred, green or blue.

Convolutional Neural Networks 697 XXX
WWW

a sumX21,1W b sumX21,2W c sumX21,3W
XXX WWW

d sumX22,1W e sumX22,2W f sumX22,3W
XXX WWW

g sumX23,1W h sumX23,2W i sumX23,3W
Figure 26.12. 2D Convolution: ai show the 2D convolution between different 22 sliding windows of X and the filter W. The final 2D convolution output is shown in i.
1
2
2
1
3
1
4
2
2
1
3
4
1
2
3
1
1
2
2
1
3
1
4
2
2
1
3
4
1
2
3
1
1
2
2
1
3
1
4
2
2
1
3
4
1
2
3
1
1
0
0
1
1
0
0
1
1
0
0
1
2
2
2
6
4
1
2
2
1
3
1
4
2
2
1
3
4
1
2
3
1
1
2
2
1
3
1
4
2
2
1
3
4
1
2
3
1
1
0
0
1
1
0
0
1
2
6
4
2
6
4
4
4
i , j , q1 xi1,j,q1
.
iki , 1 j , j , 1 q , q r 1 1 xi1,j1,q1
.
ik1 , j1 , qi r , j 1 k1 ,q1xi1,jk1,q1.
xik1,jk1,q1
xi,j,qr1 xi1,j,qr1
.
xi,j1,qr1xi1,j1,qr1.
xi,jk1,qr1 xi1,jk1,qr1
. xik1,jk1,qr1
xxxxx
xxxxx
i , j , q xi1,j,q
. xik1,j,q
ik i ,j 1, j 1 , , q q1 xi1,j1,q
. xik1,j1,q
i k1 , j1 , q i , j 1k1,q

xi1,jk1,q .
xik1,jk1,q
Figure26.13. 3DsubtensorXki,j,q:kkrsubtensorofXstartingatrowi,columnj,andchannelq.
Let W be a kkr tensor of weights, called a 3D filter, with kn and rm. Let Xki,j,q denote the kkr subtensor of X starting at row i, column j and channel q , as illustrated in Figure 26.13, with 1i, jnk1, and 1qmr1.
Given a kkr tensor ARkkr , define the summation operator as one that adds all the elements of the tensor. That is,
k k r i1 j1 q1
sumA
ai,j,q
6
1
2
2
1
3
1
4
2
2
1
3
4
1
2
3
1
1
0
0
1
2
6
4
4
4
4
8
1
2
2
1
3
1
4
2
2
1
3
4
1
2
3
1
1
2
2
1
3
1
4
2
2
1
3
4
1
2
3
1
1
2
2
1
3
1
4
2
2
1
3
4
1
2
3
1
1
0
0
1
1
0
0
1
1
0
0
1
2
6
4
2
6
4
2
6
4
4
4
8
4
4
8
4
4
8
4
4
4
4
4
4

698 Deep Learning where ai,j,q is the element of A at row i, column j, and channel q. The 3D convolution
sumXk1,nk1,1WsumXk2,nk1,1W
of X and W, denoted XW, is defined as:
sumXk1,1,1W
XW
.sumXknk1,nk1,2W
sumXk2,1,1W
.sumXknk1,1,1W sumXk1,1,2W sumXk2,1,2W
.sumXknk1,1,2W
.sumXknk1,nk1,1W
sumXk1,nk1,2WsumXk2,nk1,2W
. .sumXk1,1,mr 1W
. sumXk1,nk1,mr 1W
sumXk2,1,mr 1W .
sumXk2,nk1,mr 1W.
sumXknk1,1,mr1W
whereis the elementwise product of Xki,j,q and W, so that
sumXknk1,nk1,mr1W
sum Xki,j,qWxia1,jb1,qc1 wa,b,c 26.15
k k r a1 b1 c1
for i, j1, 2,, nk1 and q1, 2,, mr1. We can see that the convolution ofXRnnm andWRkkr resultsinank1nk1mr1tensor.
3D Convolutions in CNNs Typically in CNNs, we use a 3D filter W of size kkm, with the number of channels rm, the same as the number of channels in XRnnm. Let Xki,j be the kkm subtensor of X starting at row i and column j. Then the 3D convolution of X and W is given as:
sumXk1,1WsumXk1,nk1W sumXk2,1WsumXk2,nk1W
XW ..sumXknk1,1WsumXknk1,nk1W
We can see that when WRkkm, its 3D convolution with XRnnm results in a nk1nk1 matrix, since there is no freedom to move in the third dimension. Henceforth, we will always assume that a 3D filter WRkkm has the same number of channels as the tensor X on which it is applied. Since the number of channels is fixed based on X, the only parameter needed to fully specify W is the window size k.
Example26.73DConvolution. Figure26.14showsa333tensorXwithn3 and m3, and a 223 filter W with window size k2 and r3. The convolution

26.3 Convolutional Neural Networks
699
X
1
2
4
W
2
1
2
1
3
1
0
1
1
2
1
3

a sumX2 1, 1W

b sumX2 1, 2W

c sumX2 2, 1W

d sumX2 2, 2W
Figure 26.14. 3D Convolution: ad Convolution between different 223 sliding windows of X, and the filter W. The final 3D convolution output is shown in d.
3
1
1
0
0
1
0
5

1
1
0
1
3
2
1
1
1
1
2
1
4
2
0
3
1
2
X
1
2
4
2
1
2
1
3
1
0
1
1
2
1
3
3
1
1
1
0

5
1
1
2
0
1
1
1
3
1
1
2
1
4
2
0
11
3
1
2
W
X
1
2
4
W
2
1
2
1
3
1
0
1
2
1
3
1
0
3
1
1
1
0
5
11

1
1
2
0
1
15
1
1
3
1
1
2
1
4
2
0
3
1
2
X
1
2
4
W
2
1
2
1
3
1
0
1
1
0
2
1
3
3
1
1
1
0
5
11

1
1
2
0
1
15
5
1
1
3
1
1
2
1
4
2
0
3
1
2

700 Deep Learning
of the first window of X, namely X21,1, with W is given as see Figure 26.14a sumX21,1Wsum1 1 2 1 1 21 1 1 0 0 1
2 1312 1 200110 sum1 1 2 0 0 25
4 0012 0
where we stack the different channels horizontally. The convolution steps for differ ent 223 sliding windows of X with the filter W are shown in Figure 26.14ad. The convolution XW has size 22, since nk13212 and rm3; it is given as
XW5 11 15 5
26.3.2 Bias and Activation Functions
We discuss the role of bias neurons and activation functions in the context of a tensor
of neurons at layer l. Let Zl be an nl nl ml tensor of neurons at layer l so that zl i,j,q
denotes the value of the neuron at row i, column j and channel q for layer l, with 1i,jnl and1qml.
Filter Bias
Let W be a kkml 3D filter. Recall that when we convolve Zl and W, we get a nl k1nl k1matrixatlayerl1.However,sofar,wehaveignoredthe role of the bias term in the convolution. Let bR be a scalar bias value for W, and let Zlki,jdenotethekkml subtensorofZl atpositioni,j.Then,thenetsignalat neuron zl1 in layer l1 is given as
i,j
netl1 sumZli,jWb i,j k
and the value of the neuron zl1 is obtained by applying some activation function f to i,j
the net signal
zl1 fsumZl i,jWb i,j k
The activation function can be any of the ones typically used in neural networks, for example, identity, sigmoid, tanh, ReLU and so on. In the language of convolutions, the values of the neurons in layer l1 is given as follows
Zl1 fZl Wb
whereindicates that the bias term b is added to each element of the nlk1nl k1matrixZl W.

Convolutional Neural Networks
701
Zl
W1
3
1
4
1
1
1
1
2
3
0
1
2
2
0
1
0
0
1
1
0
Zl1
7
8
7
7
6
8
4
6
10
1
2
2
1
0
1
3
1
4
2
1
0
2
1
3
4
W2

Figure 26.15. Multiple 3D filters.

1
0
0
1
1
2
3
1
6
8
10
5
6
11
7
5
5
1
0
Multiple 3D Filters
We can observe that one 3D filter W with a corresponding bias term b results in a nlk1nlk1matrixofneuronsinlayerl1.Therefore,ifwedesireml1 channels in layer l1, then we need ml1 different kkml filters Wq with a corresponding biastermbq,toobtainthenl k1nl k1ml1 tensorofneuronvaluesat layer l1, given as
Zl1 zl1 fsumZl i,jW bi,j,q k qq
i,j1,2,,nlk1 and q1,2,,ml1 Zl1 fZl W1b1, Zl W2b2, , Zl Wml1bml1
where the activation function f distributes over all of its arguments.
In summary, a convolution layer takes as input the nlnlml tensor Zl of neurons from layer l, and then computes the nl1nl1ml1 tensor Zl1 of neurons for the nextlayerl1viatheconvolutionofZl withasetofml1 different3Dfiltersofsizek kml , followed by adding the bias and applying some nonlinear activation function f . Note that each 3D filter applied to Zl results in a new channel in layer l1. Therefore,
ml1 filters are used to yield ml1 channels at layer l1.
which can be written more compactly as
26.3.3 Padding and Striding
One of the issues with the convolution operation is that the size of the tensors will necessarily decrease in each successive CNN layer. If layer l has size nlnlml , and
0
1
Example 26.8 Multiple 3D Filters. Figure 26.15 shows how applying different fil tersyieldthechannelsforthenextlayer.Itshowsa442tensorZl withn4 and m2. It also shows two different 222 filters W1 and W2 with k2 and r2.Sincerm2,theconvolutionofZl andWi fori1,2resultsina33 matrixsincenk14213.However,W1 yieldsonechannelandW2 yields a second channel, so that the tensor for the next layer Zl1 has size 332, with two channels one per filter.

702
Deep Learning
Zl
1
2
2
1
1
3
1
4
2
1
2
1
3
4
3
1
2
3
1
1
4
1
3
2
1
1
0
0
0
1
1
0
1
0
11
9
9
11
12
8
8
7
0
0
0
0
0
0
0
0
1
2
2
1
1
0
0
3
1
4
2
1
0
0
2
1
3
4
3
0
0
1
2
3
1
1
0
0
4
1
3
2
1
0
0
0
0
0
0
0
0
6
5
7
4
2
1
0
0
0
1
1
0
1
0
6
7
11
9
5
4
9
11
12
6
7
8
8
7
6
5
5
7
6
2
Zl
W
a No padding: p0
Zl1
Zl1
W

b Padding: p1
Figure26.16. Padding:2DConvolutionawithoutpadding,andbwithpadding.
weusefiltersofsizekkml,theneachchannelinlayerl1willhavesizenl k 1nlk1. That is the number of rows and columns for each successive tensor will shrink by k1 and that will limit the number of layers the CNN can have.
Padding
To get around this limitation, a simple solution is to pad each tensor along both the rows and columns in each channel by some default value, typically zero. For uniformity, we always pad by adding the same number of rows at the top and at the bottom, and likewise the same number of columns on the left and on the right. That is, assume that we add p rows both on top and bottom, and p columns both on the left and right. With padding p, the implicit size of layer l tensor is then nl2pnl2pml. Assume that each filter is of size kkml , and assume there are ml1 filters, then the size of thelayerl1tensorwillbenl 2pk1nl 2pk1ml1.Sincewewant to preserve the size of the resulting tensor, we need to have
nl 2pk1nl,whichimplies,pk1 2
With padding, we can have arbitrarily deep convolutional layers in a CNN.
7
Example 26.9 Padding. Figure 26.16 shows a 2D convolution without and with padding. Figure 26.16a shows the convolution of a 55 matrix Zl n5 with a 33 filter W k3, which results in a 33 matrix since nk15313. Thus, the size of the next layer Zl1 has decreased.
On the other hand, zero padding Zl using p1 results in a 77 matrix as shown in Figure 26.16b. Since p1, we have an extra row of zeros on the top

Convolutional Neural Networks
703
1
2
2
1
1
3
1
4
2
1
2
1
3
4
3
1
2
3
1
1
4
1
3
2
1
1
0
0
0
1
1
0
1
0
7
1
2
2
1
1
3
1
4
2
1
2
1
3
4
3
1
2
3
1
1
4
1
3
2
1
1
0
0
0
1
1
0
1
0
7
9
Zl
Zl
W Zl1

as u mZ l31 , 1 WZl
W Zl1

c sumZl3 3, 1W d sumZl3 3, 3W Figure26.17. Striding:2DConvolutionwithstrides2.
bs u mZ l31 , 3 WZl
W Zl1
1
2
2
1
1
3
1
4
2
1
2
1
3
4
3
1
2
3
1
1
4
1
3
2
1
1
0
0
0
1
1
0
1
0
7
9
8
1
2
2
1
1
3
1
4
2
1
2
1
3
4
3
1
2
3
1
1
4
1
3
2
1
1
0
0
0
1
1
0
1
0
7
9
8
7
W Zl1
and bottom, and an extra column of zeros on the left and right. The convolution of the zeropadded X with W now results in a 55 matrix Zl1 since 7315, which preserves the size.
If we wanted to apply another convolution layer, we could zero pad the resulting matrix Zl1 with p1, which would again yield a 55 matrix for the next layer, using a 33 filter. This way, we can chain together as many convolution layers as desired, without decrease in the size of the layers.
Striding
Striding is often used to sparsify the number of sliding windows used in the convolutions. That is, instead of considering all possible windows we increment the index along both rows and columns by an integer value s1 called the stride. A 3D convolutionofZl ofsizenl nl ml withafilterWofsizekkml,usingstrides,is given as:
sumZlk1,1W sumZlk1,1sWlsumZlk1s,1W sumZlk1s,1sW
Z W . .sumZlk1t s,1W sumZlk1t s,1sW
sumZlk1,1t sWsumZlk1s,1t sW
.sumZlk1t s,1t sW
where t nl k . We can observe that using stride s, the convolution of ZlRnl nl ml s
withWRkkml resultsinat1t1matrix.

704 Deep Learning
Example 26.10 Striding. Figure 26.17 shows 2D convolution using stride s2 on a55matrixZl nl 5withafilterWofsize33k3.Insteadofthedefault stride of one, which would result in a 33 matrix, we get a t1t122 matrix Zl1, since
t nlk53 1 s2
We can see that the next window index increases by s along the rows and columns. For example, the first window is Zl31,1 and thus the second window is Z l31 , 1s Z l31 , 3 s e e F i g u r e s 2 6 . 1 7aa n d 2 6 . 1 7b . N e x t , w e m o v e d o w n b y a stride of s2, so that the third window is Zl31s,1Zl33,1, and the final window i s Z l33 , 1s Z l33 , 3 s e e F i g u r e s 2 6 . 1 7ca n d 2 6 . 1 7d .
26.3.4 Generalized Aggregation Functions: Pooling
Let Zl be a nlnlml tensor at layer l. Our discussion of convolutions so far assumes that we sum together all of the elements in the elementwise product of the kkr subtensor Zlki,j,q and the filter W. In fact, CNNs also use other types of aggregation functions in addition to summation, such as average and maximum.
AvgPooling
If we replace the summation with the average value over the elementwise product of Zlki,j,q and W, we get
avgZl i,j,qW k
avg zl w
ia1,jb1,qc1 a,b,c1 sumZlki,j,qW
a1,2, ,k b1,2, ,k c1,2, ,r
k2 r
If we replace the summation with the maximum value over the elementwise product
MaxPooling
of Zlki,j,q and W, we get
maxZl i,j,qW max zl wa,b,c 26.16
c1,2, ,r
The 3D convolution of ZlRnl nl ml with filter WRkkr using maxpooling, denoted Zl max W,resultsinanl k1nl k1ml r1tensor,givenas:
k a1,2,,k ia1,jb1,qc1 b1,2, ,k

Convolutional Neural Networks
705
maxXk1,1,1W maxXk2,1,1W
.maxXknk1,1,1W maxXk1,1,2W maxXk2,1,2W
ZlmaxW.maxXknk1,1,2W
maxXk1,nk1,1WmaxXk2,nk1,1W
.maxXknk1,nk1,1W
maxXk1,nk1,2WmaxXk2,nk1,2W
.maxXknk1,nk1,2W
. .maxXk1,1,mr 1W
. maxXk1,nk1,mr1W
maxXk2,1,mr 1W .
maxXk2,nk1,mr1W.
maxXknk1,nk1,mr1W
Maxpooling in CNNs Typically, maxpooling is used more often than avgpooling. Also, for pooling it is very common to set the stride equal to the filter size sk, so that the aggregation function is applied over disjoint kk windows in each channel in Zl . More importantly, in pooling, the filter W is by default taken to be a kk1 tensor all of whose weights are fixed as 1, so that W1kk1. In other words, the filter weights are fixed at 1 and are not updated during backpropagation. Further, the filter uses a fixed zero bias that is, b0. Finally, note that pooling implicitly uses an identity activation function. As such, the convolution of ZlRnl nl ml with WRkk1, using stridesk,resultsinatensorZl1 ofsizenl nl ml.
ss
Zl Zl W Zl1
maxXknk1,1,mr1W
1
2
2
1
3
1
4
2
2
1
3
4
1
2
3
1
1
2
2
1
3
1
4
2
2
1
3
4
1
2
3
1
W Zl1
1
1
1
1
1
1
1
1
3
3
4
a maxZl2 1, 1W Zl
W Zl1

c maxZl2 3, 1W d maxZl2 3, 3W Figure26.18. Maxpooling:Strides2.
b maxZl2 1, 3W Zl
1
2
2
1
3
1
4
2
2
1
3
4
1
2
3
1
1
2
2
1
3
1
4
2
2
1
3
4
1
2
3
1
W Zl1
1
1
1
1
1
1
1
1
3
4
3
4
2
2
4

706 Deep Learning
Example26.11Maxpooling. Figure26.18showsmaxpoolingona44matrixZl
nl4, using window size k2 and stride s2 that equals the window size. The
resultinglayerZl1 thushassize22,sincenl 42.Wecanseethatthefilter s2
W has fixed weights equal to 1.
The convolution of the first window of Zl, namely Zl21,1, with W is given as
see Figure 26.18a
maxZl21,1Wmax1 21 1max1 23 311131
The other convolution steps are shown in Figures 26.18b to 26.18d.
26.3.5 DeepCNNs
In a typical CNN architecture, one alternates between a convolution layer with summation as the aggregation function, and learnable filter weights and bias term and a pooling layer say, with maxpooling and fixed filter of ones. The intuition is that, whereas the convolution layer learns the filters to extract informative features, the pooling layer applies an aggregation function like max or avg to extract the most important neuron value or the mean of the neuron values within each sliding window, in each of the channels.
Starting from the input layer, a deep CNN is comprised of multiple, typically alternating, convolution and pooling layers, followed by one or more fully connected layers, and then the final output layer. For each convolution and pooling layer we need to choose the window size k as well as the stride value s, and whether to use padding p or not. We also have to choose the nonlinear activation functions for the convolution layers, and also the number of layers to consider.
26.3.6 Training CNNs
To see how to train CNNs we will consider a network with a single convolution layer
and a maxpooling layer, followed by a fully connected layer as shown in Figure 26.19.
For simplicity, we assume that there is only one channel for the input X, and further,
we use only one filter. Thus, X denotes the input matrix of size n0n0. The filter W0,
with bias b0, for the convolution layer l1, has size k1k1, which yields the matrix of
neuronsZ1 ofsizen1n1,wheren1 n0k1usingstrides1 1.Sinceweuseonly
one filter, this results in a single channel at layer l1, i.e., m11. The next layer l2
is a maxpooling layer Z2 of size n2n2 obtained by applying a k2k2 filter with stride
s2k2. Implicitly, the maxpooling layer uses a filter of weights fixed at 1, with 0 as the
bias,thatisW11k k andb10.Weassumethatn1isamultipleofk2sothatn2n1. 22 k2
The output of the maxpooling layer Z2 is recast as a vector z2 of length n22, since it is fully connected to the layer l3. That is, all of the n22 neurons in z2 are connected to each of the n3 neurons in z3 with the weights specified by the matrix W2, which has size n2 2n3 , and the bias vector b2Rn3 . Finally, all neurons in z3 are

Convolutional Neural Networks 707
input convolution maxpooling fully connected output l0 l1 l2 l3 l4
W0,b0 W2,b2 W3,b3 n2n2
n0n0 n1n1 k2s2 n3 Figure 26.19. Training: Convolutional neural network.
connected to the p output layer neurons o, with the weight matrix W3Rn3p and bias vectorb3 Rp.
Feedforward Phase
Let DXi,yini1 denote the training data, comprising n tensors XiRn0n0m0 with m01 for ease of explanation and the corresponding response vector yiRp . Given a training pair X, yD, in the feedforward phase, the predicted output o is given via the following equations:
Z1 f1XW0b0
Z2Z1 s2,max 1k2k2 z 3f 3W T2 z 2b 2of oW To z 3b o
where s2,max denotes maxpooling with stride s2.
Backpropagation Phase
Given the true response y and predicted output o, we can use any loss function EX to evaluate the discrepancy between them. The weights and biases are updated by computing the net gradient vector at the output layer, and then backpropagating the net gradients from layer l4 to l1. Let 1, 2, and 3 denote the net gradient vectors at layers l1, 2, 3, respectively, and let o denote the net gradient vector at the output layer. The output net gradient vector is obtained in the regular manner by computing the partial derivatives of the loss function EX and the activation function fo:
o fo EX
assuming that the output neurons are independent.
Since layer l3 is fully connected to the output layer, and likewise the
maxpooling layer l2 is fully connected to Z3, the net gradients at these layers are computed as in a regular MLP
3 f3 Wo o
2 f2 W2 3W2 3
X
Z1 1
Z2 2
z3 3
o
o
p

708 Deep Learning
Let last step follows from the fact that f21, since maxpooling implicitly uses an identity activation function. Note that we also implicitly reshape the net gradient vector 2, so that its size is n22n3n31n221n2n2, as desired.
Consider the net gradient 1 at neuron z1 in layer l1 where i,j1,2, ,n1. ij ij
Since we assume that the stride s2 equals the filter size k2 for the maxpooling layer,
each sliding window in the convolution layer contributes only to one neuron at the
maxpooling layer. Given stride s2k2, the k2k2 sliding window that contains z1 is ij
g i v e n a s Z 1k 2a , b, w h e r e
ai bj
s2 s2
Due to the max aggregation function, the maximum valued element in Z1k2a,b
specifies the value of neuron z2 in the maxpooling layer l2. That is, ab
z2max z1ab i,j1,2,,k2 a1k2i,b1k2j
i,jargmax z1
a1k2i,b1k2j i,j 1,2,,k2
where i,j is the index of the maximum valued neuron in the window Z1k2a,b. The net gradient 1 at neuron z1 is therefore given as
ij
ij
E
E net2 z1 1 XXab ij
ij net1 net2 z1 net1 ij ab ij ij
net2
2abf1
ab z1 ij ij
ll21 wherenetij denotesthenetinputatneuronzij inlayerl.However,sincenetab zi,j,
net2
the partial derivative ab is either 1 or 0, depending on whether z1 is the maximum
z 1i j i j element in the window Z1k2 a,b or not. Putting it all together, we have
2 f1 ifiiandjj 1ab ij
ij 0 otherwise
In other words, the net gradient at neuron z1 in the convolution layer is zero if
ij
this neuron does not have the maximum value in its window. Otherwise, if it is the maximum, the net gradient backpropagates from the maxpooling layer to this neuron and is then multiplied by the partial derivative of the activation function. The n1n1 matrixofnetgradients1comprisesthenetgraidents1 foralli,j1,2,,n.
ij 1
From the net gradients, we can compute the gradients of the weight matrices and
bias parameters. For the fully connected layers, that is, between l2 and l3, and l3 and l4, we have
W3 Z3 oT b3 o W2 Z2 3T b2 3 where we treat Z2 as a n2 21 vector.

Convolutional Neural Networks 709
Note that the weight matrix W1 is fixed at 1k2k2 and the bias term b1 is also fixed at 0, so there are no parameters to learn between the convolution and maxpooling layers. Finally, we compute the weight and bias gradients between the input and convolution layer as follows:
WXk i,j1 b1 0 1 ij 0 ij
nn nn
1 1
1 1
i1 j1
i1 j1
where we used the fact that the stride is s11, and that W0 is a shared filter for all k1k1 windows of X, with the shared bias value b0 for all windows. There are n1n1 such windows, where n1n0k11, therefore, to compute the weight and bias gradients, we sum over all the windows. Note that if there were multiple filters that is, if m11, then the bias and weight gradients for the jth filter would be learned from the corresponding channel j in layer l1.
Example26.12CNN. Figure26.20showsaCNNforhandwrittendigitrecognition. This CNN is trained and tested on the MNIST dataset, that contains 60,000 training images and 10,000 test images. Some examples of handwritten digits from MNIST are shown in Figure 26.21. Each input image is a 2828 matrix of pixel values between 0 to 255, which are divided by 255, so that each pixel lies in the interval 0,1. The corresponding true output yi is a onehot encoded binary vector that denotes a digit from 0 to 9; the digit 0 is encoded as e11,0,0,0,0,0,0,0,0,0T, the digit 1 as e20,1,0,0,0,0,0,0,0,0T, and so on.
In our CNN model, all the convolution layers use stride equal to one, and do not use any padding, whereas all of the maxpooling layers use stride equal to the window size. Since each input is a 2828 pixels image of a digit with 1 channel grayscale, we haven0 28andm0 1,andtherefore,theinputXZ0 isan0n0m0 28281 tensor. The first convolution layer uses m16 filters, with k15 and stride s11, without padding. Thus, each filter is a 551 tensor of weights, and across the six filterstheresultinglayerl1tensorZ1 hassize24246,withn1 n0k11 285124 and m16. The second hidden layer is a maxpooling layer that
uses k2 with a stride of s2. Since maxpooling by default uses a fixed filter
2 2 n24 W1k k 1,theresultingtensorZ2 hassize12126,withn21
12, and m26. The third layer is a convolution layer with m316 channels, with a window size of k35 and stride s31, resulting in the tensor Z3 of size 8816, where n3n2 k3 112518. This is followed by another maxpooling
22 k2 2
Input Convolution
X Z1
Maxpooling
12126 k2s22
Convolution
8816 k3 5
Maxpooling
4416 k4 s4 2
Fully Connected Layers
Z2
Z3
Z4
Z5
Z6
o
28281
120
10 84
24246 k15
Figure 26.20. Convolutional neural network.

710
Deep Learning
00000 55555 10 10 10 10 10 15 15 15 15 15 20 20 20 20 20
25 25 25 25 25
0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25
a label 0 b label 1 c label 2 d label 3 e label 4
00000 55555 10 10 10 10 10 15 15 15 15 15 20 20 20 20 20
25 25 25 25 25
0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25
f label 5 g label 6 h label 7 i label 8 j label 9 Figure 26.21. MNIST dataset: Sample handwritten digits.
layerthatusesk 2ands 2,whichyieldsthetensorZ4 thatis4416,where n3 84 4
n4 k42 4,andm416.
The next three layers are fully connected as in a regular MLP. All of the 44
16256 neurons in layer l4 are connected to layer l5, which has 120 neurons. Thus, Z5 is simply a vector of length 120, or it can be considered a degenerate tensor of size 12011. Layer l5 is also fully connected to layer l6 with 84 neurons, which is the last hidden layer. Since there are 10 digits, the output layer o comprises 10 neurons, with softmax activation function. The convolution layers Z1 and Z3, and the fully connected layers Z5 and Z6, all use ReLU activation.
We train the CNN model on n60000 training images from the MNIST dataset; we train for 15 epochs using step size 0.2 and using crossentropy error since there are 10 classes. Training was done using minibatches, using batch size of 1000. After training the CNN model, we evaluate it on the test dataset of 10,000 images. The CNN model makes 147 errors on the test set, resulting in an error rate of 1.47. Figure 26.22 shows examples of images that are misclassified by the CNN. We show the true label y for each image and the predicted label o converted back from the onehot encoding to the digit label. We show three examples for each of the labels. For example, the first three images on the first row are for the case when the true label is y0, and the next three examples are for true label y1, and so on. We can see that several of the misclassified images are noisy, incomplete or erroneous, and would be hard to classify correctly even by a human.
For comparison, we also train a deep MLP with two fully connected hidden layers with the same sizes as the two fully connected layers before the output layer in the CNN shown in Figure 26.20. Therefore, the MLP comprises the layers X, Z5, Z6, and o, with the input 2828 images viewed as a vector of size d784. The first hidden layer has size n1120, the second hidden layer has size n284, and the output layer has size p10. We use ReLU activation function for all layers, except the output, which uses softmax. We train the MLP model for 15 epochs on the training dataset with n60000 images, using step size 0.5. On the test dataset, the MLP made 264 errors, for an error rate of 2.64. Figure 26.23 shows the number of errors on

Convolutional Neural Networks
711
10 15 20 25
10 15 20 25
10 15 20 25
10 15 20 25
10 10 15 15 20 20 25 25
0510152025
10 15 20 25
10 15 20 25
10 15 20 25
10 15 20 25
10 10 15 15 20 20 25 25
0510152025
10 15 20 25
10 15 20 25
10 15 20 25
10 15 20 25
10 10 15 15 20 20 25 25
0510152025
10 15 20 25
10 15 20 25
10 15 20 25
10 15 20 25
10 10 15 15 20 20 25 25
0510152025
10 15 20 25
10 15 20 25
10 15 20 25
10 15 20 25
10 15 20 25
10 15 20 25
y0,o2 y0,o8 y0,o6 y1,o8 y1,o7 000000 555555
y1,o5 5 10 15 20 25
y3,o8 5 10 15 20 25
y5,o3 5 10 15 20 25
y7,o2 5 10 15 20 25
y9,o0 5 10 15 20 25
0510152025
0510152025
0510152025
0510152025
0
0
0
0
0
y2,o0 y2,o8 y2,o7 y3,o2 y3,o5 000000 555555
0510152025
0510152025
0510152025
0510152025
y4,o2 y4,o2 y4,o7 y5,o6 y5,o3 000000 555555
0510152025
0510152025
0510152025
0510152025
y6,o4 y6,o5 y6,o1 y7,o3 y7,o8 000000 555555
0510152025
0510152025
0510152025
0510152025
y8,o6 y8,o3 y8,o2 y9,o7 y9,o4 000000 555555
0510152025
0510152025
0510152025
0510152025
0510152025
Figure 26.22. MNIST: Incorrect predictions by the CNN model; y is the true label, o is the predicted label. the test set after each epoch of training for both the CNN and MLP model; the CNN
model achieves significantly better accuracy than the MLP.
6,000
4,000
2,000
epochs
Figure 26.23. MNIST: CNN versus Deep MLP; prediction error as a function of epochs.
errors
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
MLP CNN

712 Deep Learning
26.4 REGULARIZATION
Consider the squared error loss function, given as Ly,yyy2
where y is the true response and yo the predicted response on a given input x. The goal of learning is the minimize the expected loss ELy, y Eyy 2 .
As described in Section 22.3, the expected loss can be decomposed into three terms: noise, bias, and variance, given as
E yy2 E yEy 2 E yEy 2 E Ey Ey 2 2 6 . 1 7
noise average variance average bias
The noise term is the expected variance of y, since EyEy2vary. It contributes a fixed cost to the loss independent of the model. Since it is the inherent uncertainty or variability of y it can be ignored when comparing different models. The average bias term indicates how much the model deviates from the true but unknown function relating the response to the predictor variables. For example, if the response is a nonlinear function of the predictor variables, and we fit a simple linear model, then this model will have a high bias. We can try to fit a more complex nonlinear model to reduce the bias, but then we run into the issue of overfitting and high variance, captured by the average variance term, which quantifies the variance of the predicted response y, since EyEy2vary. That is, our model may try to fit noise and other artifacts in the data, and will therefore be highly susceptible to small changes in the data, resulting in high variance. In general, there is always a tradeoff between reducing bias and reducing variance.
Regularization is an approach whereby we constrain the model parameters to reduce overfitting, by reducing the variance at the cost of increasing the bias slightly. For example, in ridge regression Eq. 23.29, we add a constraint on the L2norm of the weight parameters as follows
min JwYY2 w2 YDw2 w2 26.18 w
where Y is the true response vector and Y is the predicted response vector over all training instances. The goal here is to drive the weights to be small, depending on the hyperparameter 0 called the regularization constant. If 0, then there is no regularization, and we get a low bias, but possibly high variance model. Ifthen the effect is to drive all weights to be nearly zero, which results in a low variance, but high bias model. An intermediate value oftries to achieve the right balance between these two conflicting objectives. As another example, in L1 regression or Lasso Eq. 23.43, we minimize the L1 norm of the weights
min JwYDw2 w1 w
wherew1 di1wi.ComparedtoL2 norm,whichmerelymakestheweightssmall, the use of the L1 norm sparsifies the model by forcing many of the weights to zero, acting as a feature subset selection method.

Regularization 713
In general, for any learning model M, if Ly,y is some loss function for a given input x, anddenotes all the model parameters, where yMx. The learning objective is to find the parameters that minimize the loss over all instances:
n
Lyi,yi
With regularization, we add a penalty on the parameters , to obtain the regularized
objective:
i1
min J
i1
min J
n i1
Lyi,Mxi
n
Lyi,yiR
26.19
where 0 is the regularization constant.
Letbe a parameter of the regression model. Typical regularization functions
include the L2 norm, the L1 norm, or even a combination of these, called the elasticnet: RL22 RL11 Relastic2 11
with 0,1.
26.4.1 L2 Regularization for Deep Learning
We now consider approaches for regularizing the deep learning models. We first consider the case of a multilayer perceptron with one hidden layer, and then generalize it to multiple hidden layers. Note that while our discussion of regularization is in the context of MLPs, since RNNs are trained via unfolding, and CNNs are essentially sparse MLPs, the methods described here can easily be generalized for any deep neural network.
MLP with One Hidden Layer
Consider regularization in the context of a feedforward MLP with a single hidden layer as shown in Figure 26.24. Let the input xRd , the hidden layer zRm and let yoRp. The set of all the parameters of the model are
Wh,bh,Wo,bo
Whereas it makes sense to penalize large weights, we usually do not penalize the bias
terms since they are just thresholds that shift the activation function and there is no l0 l1 l2
Wh,bh Wo,bo
Figure 26.24. Multilayer perceptron with one hidden layer.
x
z
o

714 Deep Learning need to force them to be small values. The L2 regularized objective is therefore given
as
min JEx RL2Wo,WhEx Wh2F Wo2F 22
Here we added the factor 12 to the regularization term for convenience. For the L2 norm of the weight matrices, we use the Frobenius norm, which has the usual sense of L2norm, since for an nm matrix A, it is defined as
n m A2a2
F ij i1 j1
where aij is the i,jth element of A. The regularized objective tries to minimize the individual weights for pairs of neurons between the input and hidden, and hidden and output layers. This has the effect of adding some bias to the model, but possibly reducing variance, since small weights are more robust to changes in the input data in terms of the predicted output values.
The regularized objective has two separate terms, one for the loss and the other for the L2 norm of the weight matrices. Recall that we have to compute the weight gradients wij and the bias gradients bj by computing
wijJEx RL2Wo,Whjziwij wij wij 2 wij
bj JEx RL2Wo,WhEx j
bj bj 2 bj whereweuseEq.25.31tonotethat Ex j zi and Ex j,andwherejEx is
bj
wijbjnetj
the net gradient. Further, since the squared L2 norm of a weight matrix is simply the sum of the squared weights, only the term w2 matters, and all other elements are just
ij
constant with respect to the weight wij between neurons i and j in Wh or Wo. Across
all the neuron pairs between the hidden and output layer, we can write the update rule compactly as follows:
W oz oTW ob o o
where o is the net gradient vector for the output neurons, and z is the vector of hidden layer neuron values. The gradient update rule using the regularized weight gradient matrix is given as
Wo Wo Wo Wo zoT WoWo Wo zoT 1 W o z oT
L2 regularization is also called weight decay, since the updated weight matrix uses decayed weights from the previous step, using the decay factor 1.
In a similar manner we get the weight and bias gradients between the input and hidden layers:
W hx hTW hb h h

Regularization
715
l 0
W0,b0
l 1
l 2

l h
l h1
x
z1
z2
W1,b1
Figure 26.25. Deep multilayer perceptron.
Wh,bh
The update rule for the weight matrix between the input and hidden layers is therefore given as
W hW h W h 1 W h x hT
where h is the net gradient vector for the hidden neurons, and x is the input vector.
Deep MLPs
Consider the deep MLP shown in Figure 26.25. We denote the input neurons as layer l0, the first hidden layer as l1, the last hidden layer as lh, and the final output layer as lh1. The vector of neuron values for layer l for l0, ,h1 is denoted as
z l z 1l ,, z nlT l
where nl is the number of neurons in layer l. Thus, xz0 and ozh1. The weight matrix between neurons in layer l and layer l 1 is denoted WlRnl nl1 , and the vector of bias terms from the bias neuron z0l to neurons in layer l1 is denoted blRnl1 , for l1,, h1.
Given the error function Ex, the L2 regularized objective function is min JEx RL2 W0,W1,,Wh
2h E x W l2F
2 l0
where the set of all the parameters of the model is W0,b0,W1,b1, ,Wh,bh. Based on the derivation for the one hidden layer MLP from above, the regularized gradient is given as:
Wl zll1TWl 26.20 and the update rule for weight matrices is
Wl Wl Wl 1Wl zl l1T 26.21
for l0,1, ,h, where where l is the net gradient vector for the hidden neurons in layer l. We can thus observe that incorporating L2 regularization within deep MLPs is relatively straightforward. Likewise, it is easy to incorporate L2 regularization in other models like RNNs, CNNs, and so on. For L1 regularization, we can apply the subgradient approach outlined for L1 regression or Lasso in Section 23.6.

zh
o

716 Deep Learning 26.4.2 Dropout Regularization
The idea behind dropout regularization is to randomly set a certain fraction of the neuron values in a layer to zero during training time. The aim is to make the network more robust and to avoid overfitting at the same time. By dropping random neurons for each training point, the network is forced to not rely on any specific set of edges. From the perspective of a given neuron, since it cannot rely on all its incoming edges to be present, it has the effect of not concentrating the weight on specific input edges, but rather the weight is spread out among the incoming edges. The net effect is similar to L2 regularization since weight spreading leads to smaller weights on the edges. The resulting model with dropout is therefore more resilient to small perturbations in the input, which can reduce overfitting at a small price in increased bias. However, note that while L2 regularization directly changes the objective function, dropout regularization is a form of structural regularization that does not change the objective function, but instead changes the network topology in terms of which connections are currently active or inactive.
MLP with One Hidden Layer
Consider the one hidden layer MLP in Figure 26.24. Let the input xRd , the hidden layer zRm and let yoRp . During the training phase, for each input x, we create a random mask vector to drop a fraction of the hidden neurons. Formally, let r0, 1 be the probability of keeping a neuron, so that the dropout probability is 1r . We create a mdimensional multivariate Bernoulli vector u0,1m, called the masking vector, each of whose entries is 0 with dropout probability 1r , and 1 with probability r . Let uu1,u2, ,umT, where
ui0 with probability 1r 1 with probability r
The feedforward step is then given as
zf hb hW hT xz uz
of ob oW oT z
whereis the elementwise multiplication. The net effect is that the masking vector zeros out the ith hidden neuron in zif ui0. Zeroing out also has the effect that during the backpropagation phase the error gradients do not flow back from the zeroed out neurons in the hidden layer. The effect is that any weights on edges adjacent to zeroed out hidden neurons are not updated.
Inverted Dropout There is one complication in the basic dropout approach above, namely, the expected output of hidden layer neurons is different during training and testing, since dropout is not applied during the testing phase after all, we do not want the predictions to be randomly varying on a given test input. With r as the probability of retaining a hidden neuron, its expected output value is
Ezirzi 1r0rzi

Further Reading 717
On the other hand, since there is no dropout at test time, the outputs of the hidden neurons will be higher at testing time. So one idea is to scale the hidden neuron values by r at testing time. On the other hand, there is a simpler approach called inverted dropout that does not need a change at testing time. The idea is to rescale the hidden neurons after the dropout step during the training phase, as follows:
zfb hW hT xz 1 uz
rT of boWoz
With the scaling factor 1r, the expected value of each neuron remains the same as without dropout, since
Dropout in Deep MLPs
Ezi 1rzi1r 0zi r
Dropout regularization for deep MLPs is done in a similar manner. Let rl0, 1, for l1,2, ,h denote the probability of retaining a hidden neuron for layer l, so that 1rl is the dropout probability. One can also use a single rate r for all the layers by setting rlr. Define the masking vector for hidden layer l, ul0,1nl , as follows:
uli0 with probability 1rl 1 with probability rl
The feedforward step between layer l and l1 is then given as z lfb lW l T zl1
zl1 u lz lrl
26.22
using inverted dropout. Usually, no masking is done for the input and output layers, so we can set r01 and rh11. Also note that there is no dropout at testing time. The dropout rates are hyperparameters of the model that have to be tuned on a separate validation dataset.
26.5 FURTHER READING
The backpropagation through time algorithm was introduced by Werbos 1990. Bidirectional RNNs were proposed by Schuster and Paliwal 1997, and LSTMs were proposed by Hochreiter and Schmidhuber 1997, with the forget gate introduced by Gers, Schmidhuber, and Cummins 2000. Convolutional neural networks trained via backpropagation were proposed by LeCun, Bottou, Bengio, and Haffner 1998, with application to handwritten digit recognition. Dropout regularization was introduced by Srivastava, Hinton, Krizhevsky, Sutskever, and Salakhutdinov 2014.

718 Deep Learning
Gers, F. A., Schmidhuber, J., and Cummins, F. 2000. Learning to forget: Continual prediction with LSTM. Neural Computation, 12 10, 24512471.
Hochreiter, S. and Schmidhuber, J. 1997. Long shortterm memory. Neural Compu tation, 9 8, 17351780.
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. 1998. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86 11, 22782324.
Schuster, M. and Paliwal, K. K. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45 11, 26732681.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. 2014. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15 1, 19291958.
Werbos, P. J. 1990. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78 10, 15501560.
26.6 EXERCISES
Q1. Consider the RNN architecture in Figure 26.26. Note that an edge labeled 1 indicates dependence on the previous time step, whereas 2 indicates dependence on values from two time steps back. Use f o and f h as the activation functions for output and hidden layers. Letbe the longest sequence length. Answer the following questions:
a UnfoldtheRNNforthreestepsandshowalltheconnections.
b Listalltheparametersrequiredinthismodel,includingtheweightmatricesand
bias vectors. Next write the forward propagation equations to compute ot .
c Writetheequationforthenetgradientvectorot attheoutputneuronsattimet.
d Writetheequationforthenetgradientvectorht atthehiddenneuronsattimet.
Q2. DerivethenetgradientequationsforbackpropagationinanRNNwithaforgetgate using vector derivatives. That is, derive equations for the net gradients at the output ot,updateut,forgett,andhiddenht layersbycomputingthederivativeoftheerror function Ext with respect to netot , netut , nett and netht , the net inputs at the output, update, forget and hidden layers, respectively, at time t.
1
1
xt ht ot
2
Figure 26.26. RNN architecture for Q1.

Exercises 719
Q3. Derive the net gradient equations at for backpropagation in an LSTM using vector derivatives of the error function Ex with respect to netta, where ao,h,c,u,,,. That is, for the output, hidden state, internal memory, and candidate update layers, as well as for the input, forget and output gate layers.
Q4. Showthatwithstrides,intheconvolutionofXRnnmandWRkkm,thereare t1t1 possible windows, where t nk .
s
Q5. Consider a CNN that takes as input substrings of length 8 over the alphabet A,B,C,D, and predicts whether the class is positive P or negative N. Assume we use two 1D filters with window size k4 and stride s2, with weights w11, 0, 1, 1T and w21, 1, 0, 1T , respectively. Assume bias is zero. The 1D convolution is followed by a maxpooling operation with window size k2 and stride s1. All of the convolutions use linear activation function, and there is no padding used. Finally, all of the neurons after the maxpooling feed into a single output neuron that uses a sigmoid activation function with bias 15 and all weights set to 1. Given input sequence ABACABCD, show the neuron values after the convolution and maxpooling layer, and the final output. Use onehot encoding for the input alphabet in lexicographic order, and let P1 and N0. What is the final output on the given input sequence.
a Input tensor: X
b 3D filter W Figure 26.27. 3D tensor for Q6.
Q6. Giventhe3DtensorinFigure26.27a.Usingpaddingp1andstrides2,answer the following questions:
a ComputetheconvolutionofXwiththe333maskWinFigure26.27b.
b Computethemaxpoolingoutputusinga331maskWofallones.
channel 1
1
1
3
1
1
2
4
4
1
2
2
2
3
1
1
1
2
2
3
1
4
1
1
2
2
channel 2
1
4
3
3
2
2
2
2
2
2
1
1
4
1
1
1
2
3
1
1
1
4
1
1
2
channel 3
1
2
4
3
2
2
2
4
2
1
1
1
3
2
4
1
4
3
2
3
1
1
1
4
4
channel 1
1
0
0
0
1
0
0
0
1
channel 2
0
0
1
0
1
0
1
0
0
channel 3
1
1
1
1
1
1
1
1
1

CHAPTER 27 RegressionEvaluation
Given a set of predictor attributes or independent variables X1,X2, ,Xd, and given the response attribute Y, the goal of regression is to learn a f , such that
YfX1,X2,,XdfX
where XX1,X2, ,XdT is the ddimensional multivariate random variable comprised of the predictor variables. Here, the random variabledenotes the inherent error in the response that is not explained by the linear model.
When estimating the regression function f , we make assumptions about the form of f , for example, whether f is a linear or some nonlinear function of the parameters of the model. For example, in linear regression, we assume that
f X1X1. . .dXd
whereisthebiasterm,andi istheregressioncoefficientforattributeXi.Themodel is linear, since fX is a linear function of the parameters ,1, ,d.
Once we have estimated the bias and coefficients, we need to formulate a probabilistic model of regression to evaluate the learned model in terms of goodness of fit, confidence intervals for the parameters, and to test for the regression effects, namely whether X really helps in predicting Y. In particular, we assume that even if the value of X has been fixed, there can still be uncertainty in the response Y. Further, we will assume that the erroris independent of X and follows a normal or Gaussian distribution with mean 0 and variance 2, that is, we assume that the errors are independent and identically distributed with zero mean and fixed variance.
The probabilistic regression model comprises two componentsthe deterministic component comprising the observed predictor attributes, and the random error component comprising the error term, which is assumed to be independent of the predictor attributes. With the assumptions on the form of the regression function f , and the error variable , we can answer several interesting questions such as how good of a fit is the estimated model to the input data? How close are the estimated bias and coefficients to the true but unknown parameters, and so on. We consider such questions in this chapter.
720

Univariate Regression 721
27.1 UNIVARIATE REGRESSION
In univariate regression we have one dependent attribute Y and one independent attribute X, and we assume that the true relationship can be modeled as a linear function
YfXX
whereis the slope of the best fitting line andis its intercept, andis the random
error variable that follows a normal distribution with mean 0 and variance2 .
Mean and Variance of Response Variable
Consider a fixed value x for the independent variable X. The expected value of the response variable Y given x is
EYXxExxEx
The last step follows from our assumption that E0. Also, since x is assumed to be fixed, andandare constants, the expected value Exx. Next, consider the variance of Y given Xx, we have
varYXxvarxvarxvar02 2
Here varx0, since ,and x are all constants. Thus, given Xx, the response variable Y follows a normal distribution with mean EYXxx, and variance varYXx2.
Estimated Parameters
The true parameters ,and 2 are all unknown, and have to be estimated from the training data D comprising n points xi and corresponding response values yi, for i1,2, ,n. Let b and w denote the estimated bias and weight terms; we can then make predictions for any given value xi as follows:
y ibwx i
The estimated bias b and weight w are obtained by minimizing the sum of squared
errors SSE, given as
n
n i1
yi yi2
with the least squares estimates given as see Eq. 23.11 and Eq. 23.8
According to our model, the variance in prediction is entirely due to the random error term . We can estimate this variance by considering the predicted value yi and its deviation from the true response yi , that is, by looking at the residual error
o iy iy i
yi bwxi2 wXY bYwX
SSE
i1
X2 27.1.1 EstimatingVariance2

722 Regression Evaluation One of the properties of the estimated values b and w is that the sum of residual
errors is zero, since
nn
oiyi bwxi i1 i1
nyi Y wX wxi
i1
n n
yinYwnXxi i1 i1
nY nY wnX nX0 27.1 Thus, the expected value of oi is zero, since Eoi 1 n oi0. In other words, the
n i1
sum of the errors above and below the regression line cancel each other.
Next, the estimated variance2 is given as
1n 2 1n 1n
Thus, the estimated variance is
2S S E n2
We divide by n2 to get an unbiased estimate, since n2 is the number of degrees of freedom for estimating SSE see Figure 27.2. In other words, out of the n training points, we need to estimate two parametersand , with n2 remaining degrees of freedom.
The squared root of the variance is called the standard error of regression
S S E2 7 . 3
n2
27.1.2 Goodness of Fit
The SSE value gives an indication of how much of the variation in Y cannot be explained by our linear model. We can compare this value with the total scatter, also called total sum of squares, for the dependent variable Y, defined as
n i1
Notice that in TSS, we compute the squared deviations of the true response from the true mean for Y, whereas, in SSE we compute the squared deviations of the true response from the predicted response.
2v a ro i n2o iEo i n2o i2n2
i1 i1 i1
y iy i2
2 7 . 2
TSS
yi Y2

Univariate Regression 723 The total scatter can be decomposed into two components by adding and
subtracting yi as follows
TSS

n
yi Y2yi yi yi Y2
i1
n n n
yi yi2yi Y2 2 yi yiyi Y i1 i1 i1
n i1
nn
22

whereweusethefactthat ni1yi yiyi Y0,and
yi yiyi Y SSERSS i1
i1
is a new term called regression sum of squares that measures the squared deviation of
the predictions from the true mean. TSS can thus be decomposed into two parts: SSE,
which is the amount of variation not explained by the model, and RSS, which is the
amount of variance explained by the model. Therefore, the fraction of the variation
left unexplained by the model is given by the ratio SSE . Conversely, the fraction of TSS
the variation that is explained by the model, called the coefficient of determination or simply the R2 statistic, is given as
R2TSSSSE 1 SSERSS 27.4 TSS TSS TSS
The higher the R2 statistic the better the estimated model, with R20, 1.
R S S
n i1
y i Y2
Example 27.1 Variance and Goodness of Fit. Consider Example 23.1 that shows the regression of petal length X; the predictor variable on petal width Y; the response variable for the Iris dataset. Figure 27.1 shows the scatterplot between the two attributes. There are a total of n150 data points. The least squares estimates for the bias and regression coefficients are as follows
w0.4164 b0.3665 The SSE value is given as
150 150
SSEoi2 yi yi2 6.343
i1 i1
Thus, the estimated variance and standard error of regression are given as
2 SSE 6.343 24.28610
n2 1 4 8
SSE 4.2861020.207 n2

724
Regression Evaluation
2.5 2.0 1.5 1.0 0.5
0 0.5

0 1.0
Figure27.1. Scatterplot:petallengthXversuspetalwidthY.Meanpointshownasblackcircle.
2.0
3.0
X: petal length
6.0
7.0
Geometry of Goodness of Fit
Recall that Y can be decomposed into two orthogonal parts as illustrated in Figure 27.2a.
YYo
where Y is the projection of Y onto the subspace spanned by 1,X. Using the fact
that this subspace is the same as that spanned by the orthogonal vectors 1,X, with XXX1, we can further decompose Y as follows
Yp r o j 1Y 1p r o j XY X Y1Y T XX Y1wX2 7 . 5X TX
Likewise, the vectors Y and Y can be centered by subtracting their projections along the vector 1
YY Y1 YY Y1wX
where the last step follows from Eq. 27.5. The centered vectorsY,Y andX all lie in the n1 dimensional subspace orthogonal to the vector 1, as illustrated in Figure 27.2b.
4.0 5.0
For the bivariate Iris data, the values of TSS and RSS are given as TSS86.78 RSS80.436
We can observe that TSSSSERSS. The fraction of variance explained by the model, that is, the R2 value, is given as
R2RSS80.4360.927 TSS 86.78
This indicates a very good fit of the linear model.
Y: petal width

Univariate Regression
725
x1 x1Y Y
oYY
1
Y
o

x2
xn
1
x2
xn

Y YY
XX XX
a Uncentered b Centered
Figure 27.2. Geometry of univariate regression: uncentered and centered vectors. The vector space that is the complement of 1 has dimensionality n1. The error space containing the vector ois orthogonal to X , and has dimensionality n2, which also specifies the degrees of freedom for the estimated variance2 .
In this subspace, the centered vectors Y and Y, and the error vector o form a right triangle, since Y is the orthogonal projection of Y onto the vector X . Noting that oYYY Y , by the Pythagoras theorem, we have
2 2 2 22 Y Y oY YY
This equation is equivalent to the decomposition of the total scatter, TSS, into sum of squared errors, SSE, and residual sum of squares, RSS. To see this, note that the total scatter, TSS, is defined as follows:
n
2 22 TSS yiYYY1Y
i1
The residual sum of squares, RSS, is defined as
n
22 2 RSS yiYYY1Y
i1
Finally, the sum of squared errors, SSE, is defined as
S S E o2 YY2
Thus, the geometry of univariate regression makes it evident that
2 2
Y Y YY
2
YY 12 YY 12 YY2 TSSRSSSSE

726 Regression Evaluation

Notice further that since Y , Y and o form a right triangle, the cosine of the angle
betweenY andY is given as the ratio of the base to the hypotenuse. On the other hand, via Eq. 2.30, the cosine of the angle is also the correlation between Y and Y, denoted
YY. Thus, we have:

YY Yc o s
Y
Y Y YY
Notethat, whereas YY1, due to the projection operation, the angle between Y and Y is always less than or equal to 90, which means that YY0,1 for univariate regression. Thus, the predicted response vector Y is smaller than the true response vector Y by an amount equal to the correlation between them. Furthermore, by Eq. 27.4, the coefficient of determination is the same as the squared correlation between Y and Y
We can observe that
2 R2RSSY 2
TSS2 YY Y
Example 27.2 Geometry of Goodness of Fit. Continuing Example 27.1, the corre
lation between Y and Y is given as

Y Yc o s9 . 3 1 60 . 9 6 3
Y Y

80.436 8.969 86.78
RSS TSS
The square of the correlation is equivalent to R2, since 2 0.96320.927
YY The angle between Y and Y is given as
cos10.96315.7 The relatively small angle indicates a good linear fit.
27.1.3 Inference about Regression Coefficient and Bias Term
The estimated values of the bias and regression coefficient, b and w, are only point estimates for the true parametersand . To obtain confidence intervals for these parameters, we treat each yi as a random variable for the response given the corresponding fixed value xi. These random variables are all independent and identicallydistributedasY,withexpectedvaluexi andvariance2.Ontheother hand, the xi values are fixed a priori and therefore X and X2 are also fixed values.

Univariate Regression
727
We can now treat b and w as random variables, with bw
YX
ni1xi Xyi Y 1 n
n
where ci is a constant since xi is fixed, given as cixiX
sX
w n x 2 s xiXyi i1iX Xi1 i1
ciyi
27.6
and sXni1xiX2 is the total scatter for X, defined as the sum of squared deviations of xi from its mean X. We also use the fact that
n
i1
Note that
n 1n
n
xi XY Yxi X0
i1 X i1
i1
cis xi X0
Mean and Variance of Regression Coefficient
The expected value of w is given as
nn n
EwE ciyici Eyi cixi
i1 i1 i1
nnn
cici xisXxi XxisX sXi1 i1 i1
which follows from the observation that ni1 ci0, and further n nn
sXxi X2
i1 i1 i1
varwvar
i1
27.7
27.8
sinceci isaconstant,varyi2,andfurther
n
n i1
2 ci2s
X
xi2 n2Xxi Xxi
Thus, w is an unbiased estimator for the true parameter .
Using the fact that the variables yi are independent and identically distributed as
Y, we can compute the variance of w as follows
n
ci2 varyi2n1n sX1
ci yi
ci2sX2xiX2sX2 sX
i1
i1 i1
The standard deviation of w, also called the standard error of w, is given as
sewvarw
s
X

728 Regression Evaluation Mean and Variance of Bias Term
The expected value of b is given as
1 n
EbEYwXE n yiwX 1ni11n
n Eyi XEw n xi X i1 i1
X X
Thus, b is an unbiased estimator for the true parameter .
Using the observation that all yi are independent, the variance of the bias term can
be computed as follows
varbvarYwX
1n
var n yi varXw
i1
1 n22Xvarw122X2
n2 nsX 1 2X2
n sX
where we used the fact that for any two random variables A and B, we have varABvarAvarB. That is, variances of A and B add, even though we are computing the variance of AB.
The standard deviation of b, also called the standard error of b, is given as
sebvarb 12X n sX
Covariance of Regression Coefficient and Bias
We can also compute the covariance of w and b, as follows covw,bEwbEwEbEY wXw
27.9
Y EwX Ew2 Y X varwEw2
2
YX
sX
X 2 sX

2 X 2YX

sX
where we use the fact that varwEw2Ew2, which implies Ew2varwEw2,andfurtherthatYX .

Univariate Regression 729
Confidence Intervals
Since the yi variables are all normally distributed, their linear combination also follows a normal distribution. Thus, w follows a normal distribution with meanand variance 2sX. Likewise, b follows a normal distribution with meanand variance 1n2XsX2.
Since the true variance2 is unknown, we use the estimated variance2 from Eq. 27.2, to define the standardized variables Zw and Zb as follows
ZwwEww ZbbEb b 27.10
sew
These variables follow the Students t distribution with n2 degrees of freedom. Let Tn2 denote the cumulative t distribution with n2 degrees of freedom, and let t2 denote the critical value of Tn2 that encompasses 2 of the probability mass in the right tail. That is,
PZt2 orequivalentlyTn2t21 22
s X s eb1n 2Xs X
Since the t distribution is symmetric, we have
PZt21 orequivalentlyTn2t2
22
Given confidence level 1, i.e., significance level 0,1, the 1001
confidence interval for the true values,and , are therefore as follows Pwt2sewwt2sew1
Pbt2sebbt2seb1
Example 27.3 Confidence Intervals. Continuing with Example 27.1, we consider the variance of the bias and regression coefficient, and their covariance. However, since we do not know the true variance 2, we use the estimated variance and the standard error for the Iris data
2SSE 2 4.28610
n2
4.2861020.207
Furthermore, we have
X3.7587 sX463.864
Therefore, the estimated variance and standard error of w is given as 2 4.286102 5
varwsX463.8649.2410
sewvarw9.241059.613103

730 Regression Evaluation
The estimated variance and standard error of b is v a rb1 2X2
n sX
1 3.75924.286102
150 463.864
3.7121024.2861021.591103
sebvarb1.5911033.989102 and the covariance between b and w is
X 2 3.75874.286102 4 covw,b sX463.864 3.47310
For the confidence interval, we use a confidence level of 10.95 or 0.05. The critical value of the tdistribution, with n2148 degrees of freedom, that encompasses 20.025 fraction of the probability mass in the right tail is t21.976. We have
t2sew1.9769.6131030.019
Therefore, the 95 confidence interval for the true value, , of the regression
coefficient is given as
wt2sew, wt2sew0.41640.019,0.41640.019
0.397, 0.435
Likewise, we have:
t2seb1.9763.9891020.079 Therefore, the 95 confidence interval for the true bias term, , is
bt2seb, bt2seb0.36650.079,0.36650.0790.446, 0.288
27.1.4 Hypothesis Testing for Regression Effects
One of the key questions in regression is whether X predicts the response Y. In the regression model, Y depends on X through the parameter , therefore, we can check for the regression effect by assuming the null hypothesis H0 that 0, with the alternative hypothesis Ha being 0:
H0 : 0 Ha : 0
When 0, the response Y depends only on the biasand the random error . In other words, X provides no information about the response variable Y.

Univariate Regression 731 Now consider the standardized variable Zw from Eq. 27.10. Under the null
hypothesis we have Ew0. Thus,
wEw w
We can therefore compute the pvalue for the Zw statistic using a twotailed test via the t distribution with n2 degrees of freedom. Given significance levele.g., 0.01, we reject the null hypothesis if the pvalue is below . In this case, we accept the alternative hypothesis that the estimated value of the slope parameter is significantly different from zero.
We can also define the f statistic, which is the ratio of the regression sum of squares, RSS, to the estimated variance, given as
Zw sew
s 27.11
X
R S Sni1y i Y2
f2 n 2 i1yi yi
Under the null hypothesis, one can show that ERSS2
Further, it is also true that
E22
Thus, under the null hypothesis the f statistic has a value close to 1, which indicates that there is no relationship between the predictor and response variables. On the other hand, if the alternative hypothesis is true, then ERSS2, resulting in a larger f value. In fact, the f statistic follows a F distribution with 1, n2 degrees of freedom for the numerator and denominator, respectively; therefore, we can reject the null hypothesis that w0 if the pvalue of f is less than the significance level , say 0.01.
Interestingly the f test is equivalent to the t test since Zw2f . We can see this as follows:
27.12
1 n
f2
i1
1 n
2
i1
1 n
yi Y22bwxi Y2
i1
Y wX wxi Y22
n2
1 n w 2s X 2w2 xiX2 2
i1 w2 2 2s XZ w
Test for Bias Term
1 ni1
2
wxi X
Note that we can also test if the bias value is statistically significant or not by setting up the null hypothesis, H0 : 0, versus the alternative hypothesis Ha : 0. We then

732 Regression Evaluation evaluate the Zb statistic see Eq. 27.10 under the null hypothesis:
ZbbEb b 27.13 s eb 1n 2Xs X
since, under the null hypothesis Eb0. Using a twotailed ttest with n2 degrees of freedom, we can compute the pvalue of Zb. We reject the null hypothesis if this value is smaller than the significance level .
Example 27.4 Hypothesis Testing. We continue with Example 27.3, but now we test for the regression effect. Under the null hypothesis we have 0, further Ew0. Therefore, the standardized variable Zw is given as
ZwwEww0.4164 43.32 sew sew 9.613103
Using a twotailed t test with n2 degrees of freedom, we find that pvalue43.320
Since this value is much less than the significance level 0.01, we conclude that observing such an extreme value of Zw is unlikely under the null hypothesis. Therefore, we reject the null hypothesis and accept the alternative hypothesis that 0.
Now consider the f statistic, we have
fRSS80.4361876.71
2 4.286102
Using the F distribution with 1, n2 degrees of freedom, we have
pvalue1876.710
In other words, such a large value of the f statistic is extremely rare, and we can reject the null hypothesis. We conclude that Y does indeed depend on X, since 0. Finally, we test whether the bias term is significant or not. Under the null
hypothesis H0 : 0, we have
Zbb0.36659.188
seb 3.989102 Using the twotailed ttest, we find
pvalue9.1883.351016
It is clear that such an extreme Zb value is highly unlikely under the null hypothesis. Therefore, we accept the alternative hypothesis that Ha : 0.

Univariate Regression 733 27.1.5 StandardizedResiduals
Our assumption about the true errors i is that they are normally distributed with mean 0 and fixed variance 2. After fitting the linear model, we can examine how well the residual errors oiyiyi satisfy the normality assumption. For this, we need to compute the mean and variance of oi , by treating it as a random variable.
The mean of oi is given as
EoiEyi yiEyiEyi
xi Ebwxixi xi0
which follows from the fact that Eb and Ew.
To compute the variance of oi , we will express it as a linear combination of the yj
variables, by noting that
1n1n n
nx j X
ws xj yjnXYs xj yjXyjsyj Xj1 Xj1j1 j1X
n1
bYwX j 1 nyj wX
Therefore, we can express oi , as follows
oi yi yi yi bwxi yi n 1yj wX wxi
j1 n yin 1yjxiXw
j1 n
yi n 1yj n xi Xxj X yj
j1 n j1 sX
11xiX 2 yi1xiX xjXyj
27.14 where we have separated yi from the rest of the yj s, so that all terms in the summation
nsX jinsX are independent. Define aj as follows:
aj 1xiXxjX n sX
Rewriting Eq. 27.14 in terms of aj , we get
varoivar 1aiyiaj yj
j i
1ai2 varyi aj2 varyj j i
27.15

734
Regression Evaluation
2 12ai ai2 aj2, j i
n2 12ai aj2
j1 Consider the term nj1 aj2, we have
since varyivaryj2
27.16
n 1 xi Xxj X2 aj2nsX
n
j1 j1
n 1 2xiXxjXxiX2xjX2 j1 n 2 ns X s X2
1 2xiXn xiX2n
nns xj X s2 xj X2
X j1 X j1 sincenj1xj X0andnj1xj X2 sX,weget
n 1 xiX2 aj2ns
j1 X
Plugging Equations 27.15 and 27.17 into Eq. 27.16, we get
22 2xi X2 1 xi X2varoi1n sX n sX
211xiX 2n sX
27.17
We can now define the standardized residual oi by dividing oi by its standard deviation after replacing2 by its estimated value2 . That is,
oi oi oi 27.18 varoi 1 xi X2
These standardized residuals should follow a standard normal distribution. We can thus plot the standardized residuals against the quantiles of a standard normal distribution, and check if the normality assumption holds. Significant deviations would indicate that our model assumptions may not be correct.
1n sX
Example 27.5 Standardized Residuals. Consider the Iris dataset from Exam ple 27.1, with the predictor variable petal length and response variable petal width, and n150. Figure 27.3a shows the quantilequantile QQ plot. The yaxis is the list of standardized residuals sorted from the smallest to the largest. The xaxis

Multiple Regression
735
2
0 2
2
0
2
246
X b X vs. o
2 0 2 Q
a QQplot
Figure 27.3. Residual plots: a QuantileQuantile scatter plot of normal quantiles versus standardized residuals. b Independent variable versus standardized residuals.
is the list of the quantiles of the standard normal distribution for a sample of size n, defined as
Qq1,q2,,qnT qi F1i0.5
n
where F is the cumulative distribution function CDF, and F 1 is the inverse CDF or quantile function see Eq. 2.2 for the normal distribution. Thus, the Q values are also sorted in increasing order from smallest to largest. If the standardized residuals follow a normal distribution, then the QQ plot should follow a straight line. Figure 27.3a plots this perfect line for comparison. We can observe that the residuals are essentially normally distributed.
The plot of the independent variable X versus the standardized residuals is also instructive. We can see in Figure 27.3b that there is no particular trend or pattern to the residuals, and the residual values are concentrated along the mean value of 0, with the majority of the points being within two standard deviations of the mean, as expected if they were sampled from a normal distribution.
27.2 MULTIPLE REGRESSION
In multiple regression there are multiple independent attributes X1,X2, ,Xd and a single dependent or response attribute Y, and we assume that the true relationship can be modeled as a linear function
Y1 X1 2 X2 d Xd
whereistheinterceptorbiastermandi istheregressioncoefficientforattributeXi. Recallthati denotestheexpectedincreaseinYwithaunitincreaseinthevalueofXi, assuming all other variables are held fixed. We assume thatis a random variable that
o
o

736 Regression Evaluation
is normally distributed with mean 0 and variance2 . Further, we assume that the errors for different observations are all independent of each other, and consequently the observed responses are also independent.
Mean and Variance of Response Variable
Let XX1,X2, ,XdTRd denote the multivariate random variable comprising the independent attributes. Let xx1,x2, ,xdT be some fixed value of X, and let 1,2, ,d T. The expected response value is then given as
dEYXxE1 x1 d xd Ei xi E
i1 11d xd Tx
which follows from the assumption that E0. The variance of the response variable is given as
d d
varYXxvari xivari xi var02 2
i1 i1
which follows from the assumption that all xi are fixed a priori. Thus, we conclude that YalsofollowsanormaldistributionwithmeanEYx di1i xi Txand variance varYx 2 .
Estimated Parameters
The true parameters , 1,2, ,d and 2 are all unknown, and have to be estimated from the training data D comprising n points xi and corresponding response values yi , for i1,2, ,n. We augment the data matrix by adding a new column X0 with all values fixed at 1, that is, X01. Thus, the augmented data D Rnd1 comprises the d1 attributes X0,X1,X2, ,Xd, and each augmented point is given as x i1,xi1,xi2, ,xid T.
Let bw0 denote the estimated bias term, and let wi denote the estimated regression weights. The augmented vector of estimated weights, including the bias term, is
ww0,w1,,wdT
We then make predictions for any given point xi as follows:
y ib1w 1x i 1 w dx i dwT xi
Recall that these estimates are obtained by minimizing the sum of squared errors
SSE, given as
n nd2 S S E y iy i2 y ibw jx i j
i1 i1 j1

Multiple Regression
737
with the least squares estimate Eq. 23.20 given asT 1 T
w D D D Y The estimated variance2 is then given as
SSE 1 n
y iy i2
We divide by nd 1 to get an unbiased estimate, since nd 1 is the number of degrees of freedom for estimating SSE see Figure 27.5. In other words, out of the n training points, we need to estimate d1 parameters,and the i s, with nd1 remaining degrees of freedom.
Estimated Variance is Unbiased We now show that2 is an unbiased estimator of the true but unknown variance 2. Recall from Eq. 23.18 that
YD wD D TD 1D TYHY
where H is the nn hat matrix assuming that DTD1 exists. Note that H is an orthogonal projection matrix, since it is symmetric HTH and idempotent H2H. The hat matrix H is symmetric since
HTDDTD1DTTDTTDTDT1DTH and it is idempotent since
H2 D D TD 1D TD D TD 1D T D D TD 1D T H Furthermore, the trace of the hat matrix is given as
trHtrDDTD1DTtrDTDDTD1trId1d1
where Id 1 is the d1d1 identity matrix, and we used the fact that the trace of a product of matrices is invariant under cyclic permutations.
Finally, note that the matrix IH is also symmetric and idempotent, where I is the nn identity matrix, since
2n d1 nd1
2 7 . 1 9
IHT IT HT IH
IH2 IHIHIHHH2 IH
Now consider the squared error; we have
SSEYY2 YHY2 IHY2
YTIHIHYYTIHY However, note that the response vector Y is given as
YD
27.20
i1

738 Regression Evaluation
where0 , 1 ,, d T is the true augmented vector of parameters of the model, and 1,2, ,nT is the true error vector, which is assumed to be normally distributed with mean E0 and with fixed variance i 2 for each point, so that cov2I. Plugging the expression of Y into Eq. 27.20, we get
S S EY TIHY D TIH D D T IHDIH
0
IHTD TIHDTIHDTIH TIH
0

where we use the observation that
IHD D H D DD DT D1 DTD D D 0
Let us consider the expected value of SSE; we have ESSEETIH
n nn n nn 2 2
E ihijijEi hijEij i1 i1 j1 i1 i1 j1
n
1hiiEi2,sincei areindependent,andthereforeEij0n2 2 2

n hii ntrHnd1
i1
i1
whereweusedthefactthat2 variEi2Ei2 Ei2,sinceEi0.It follows that
2 E SSE1 ESSE 1 nd12 2 27.21 nd 1 nd 1 nd 1
27.2.1 Goodness of Fit
Following the derivation in Section 27.1.2, the decomposition of the total sum of squares, TSS, into the sum of squared errors, SSE, and the residual sum of squares, RSS, holds true for multiple regression as well:
TSSSSERSS
n n n
yi Y2yi yi2yi Y2 i1 i1 i1

Multiple Regression
739
X1
Y

Figure27.4. Multipleregression:sepallengthX1andpetallengthX2withresponseattributepetal width Y. The vertical bars show the residual errors for the points. Points in white are above the plane, whereas points in gray are below the plane.
The coefficient of multiple determination, R2, gives the goodness of fit, measured as the fraction of the variation explained by the linear model:
R2 1 SSETSSSSERSS 27.22 TSS TSS TSS
One of the potential problems with the R2 measure is that it is susceptible to increase as the number of attributes increase, even though the additional attributes may be uninformative. To counter this, we can consider the adjusted coefficient of determination, which takes into account the degrees of freedom in both TSS and SSE
2 SSEnd 1 n1SSE
Ra 1 TSSn1 1nd1TSS 27.23
We can observe that the adjusted R2a measure is always less than R2, since the ratio n11. If there is too much of a difference between R2 and R2 , it might indicate that
nd1 a
there are potentially many, possibly irrelevant, attributes being used to fit the model.
Example 27.6 Multiple Regression: Goodness of Fit. Continuing with multiple regression from Example 23.3, Figure 27.4 shows the multiple regression of sepal length X1and petal length X2on the response attribute petal width Y for the Iris dataset with n150 points. We also add an ext ra attribute X01150, which is a vector of all ones in R150. The augmented dataset DR1503 comprises n150 points, and three attributes X0, X1 and X2.

X2

740 Regression Evaluation
Theuncentered33scattermatrixD TDanditsinversearegivenas
150.0 876.50 563.80 0.793 0.176 0.064 DTD 876.5 5223.85 3484.25 DTD10.176 0.041 0.017
563.8 3484.25 2583.00 0.064 0.017 0.009 The augmented estimated weight vector wis given as
w00.014 ww 1DT D1 DT Y 0 . 0 8 2
w2 0.45 The bias term is therefore bw00.014, and the fitted model is
Y0.0140.082X10.45X2
Figure 27.4 shows the fitted hyperplane. It also shows the residual error for each point. The white colored points have positive residuals i.e., oi0 or y iyi , whereas the gray points have negative residual values i.e., oi0 or yiy.
The SSE value is given as
SSEoi2 yi yi2 6.179
i1 i1
Thus, the estimated variance and standard error of regression are given as
2 SSE

6.179 2
150 150
nd1 S S E
4.20310
4 . 2 0 31 020 . 2 0 5
1 4 7
nd1
The values of total and residual sum of squares are given as
TSS86.78 RSS80.60
We can observe that TSSSSERSS. The fraction of variance explained by the
model, that is the R2 value, is given as
R2RSS80.600.929
TSS 86.78
This indicates a very good fit of the multiple linear regression model. Nevertheless, it
makes sense to also consider the adjusted R2a value
R2a 1 n1SSE 1 1496.179 0.928
nd 1TSS 14786.78 The adjusted value is almost the same as the R2 value.

Multiple Regression
741
x1
Y

X2
o
X1 Y
U1
xn
U2
x2
Figure 27.5. Geometry of multiple regression. The figure shows two centered predictor variables X1 and X2, along with the corresponding orthogonal basis vectors U1 and U2. The subspace 1 is not shown. The dimensionality of the error space, containing the vector o, is nd1, which also specifies the degrees of
freedom for the estimated variance2 .
Geometry of Goodness of Fit
In multiple regression there are d predictor attributes X1,X2, ,Xd. We can center them by subtracting their projection along the vector 1 to obtain the centered predictor vectors Xi . Likewise, we can center the response vector Y and the predicted vector Y. Thus, we have
X iX i X i1 YY Y1 YY Y1
Once Y, Y and Xi s have been centered, they all lie in the n1 dimensional subspace orthogonal to the vector 1. Figure 27.5 shows this n1 dimensional subspace. In this subspace, we first extract an orthogonal basis U1,U2, ,Ud via the GramSchmidt orthogonalization process outlined in Section 23.3.1, and the predicted response vector is the sum of the projections of Y onto each of the new basis vectors see Eq. 23.23.
The centered vectors Y and Y, and the error vector o form a right triangle, and thus, by the Pythagoras theorem, we have
2 2 2 22 Y Y oY YY
27.24
TSSRSSSSE
The correlation between Y and Y is the cosine of the angle betweenY andY, which
is also given as the ratio of the base to the hypotenuse
Y
YY Yc o s

742
Regression Evaluation
Furthermore, by Eq. 27.4, the coefficient of multiple determination is given as
2 R2RSSY 2
TSS2 YY Y
Example 27.7 Geometry of Goodness of Fit. Continuing Example 27.6, the corre
lation between Y and Y is given as

80.60
86.78
The angle between Y and Y is given as
cos10.96415.5
The relatively small angle indicates a good linear fit.

Y RSS
Y Yc o s
Y TSS
0 . 9 6 4
27.2.2 Inferences about Regression Coefficients LetYbetheresponsevectoroverallobservations.Letww0,w1,w2,,wdT bethe
estimated vector of regression coefficients, computed asT 1 T
w D D D Y The expected value of wis given as follows:
EwE DT D1 DT YDT D1 DTEYD TD 1D T ED D TD 1D TD
since E0. Thus, wis an unbiased estimator for the true regression coefficients vector .
Next, we compute the covariance matrix for w, as follows
covwcovDTD1DTY, letting ADTD1DT, we get
covAYA covYAT
A 2IAT
DT D1 DT 2ID DT D1
2 DT D1DT DDT D1 2DT D1
2 7 . 2 5
Here, we made use of the fact that ADTD1DT is a matrix of fixed values, and therefore covAYA covYAT . Also, we have covY 2I, where I is the nn identity matrix. This follows from the fact that the observed responses yis are all independent and have the same variance 2.

Multiple Regression 743 Note that DTD Rd1d1 is the uncentered scatter matrix for the augmented
data. Let C denote the inverse of DTD. That is
DT D1C
Therefore, the covariance matrix for wcan be written as covw 2C
In particular, the diagonal entries 2 cii give the variance for each of the regression coefficient estimates including for bw0 , and their squared roots specify the standard errors. That is
varwi2 cii sewivarwi cii
We can now define the standardized variable Zwi that can be used to derive the
confidence intervals for i as follows
wi Ewi wi i
c 27.26
Zwi sew
i ii
where we have replaced the unknown true variance2 by2 . Each of the variables Zwi follows a t distribution with nd1 degrees of freedom, from which we can obtain the 1001 confidence interval of the true value i as follows:
Pwit2sewii wit2sewi1
Here, t2 is the critical value of the t distribution, with nd1 degrees of freedom,
that encompasses 2 fraction of the probability mass in the right tail, given as
PZt2 orequivalentlyTnd1t21 22
Example 27.8 Confidence Intervals. Continuing with multiple regression from Example 27.6, we have
24 . 2 0 31 02
0.793 0.176 0.064 CDTD10.176 0.041 0.017
0.064 0.017 0.009
Therefore, the covariance matrix of the estimated regression parameters, including
the bias term, is given as
covw 2 C7.379103
3.333102 2.678103
7.379103 1.714103 7.012104
2.678103 7.012104
3.775104

744 Regression Evaluation
The diagonal entries specify the variances and standard errors of each of the estimated parameters
varb3.333102 varw1 1.714103 varw2 3.775104
seb3.3331020.183 sew1 1.7141030.0414 sew2 3.7751040.0194
where bw0.
Using confidence level 10.95 or significance level 0.05, the critical
value of the tdistribution that encompasses 0.025 fraction of the probability 2
mass in the right tail is given as t21.976. Thus, the 95 confidence intervals for the true bias term , and the true regression coefficients 1 and 2, are:
bt2seb0.0140.074, 0.0140.0740.088, 0.06
1w1 t2 sew1 0.0820.0168,0.0820.01680.099, 0.065
2w2 t2 sew2 0.450.00787,0.450.007870.442, 0.458
27.2.3 Hypothesis Testing
Once the parameters have been estimated, it is beneficial to test whether the regression coefficients are close to zero or substantially different. For this we set up the null hypothesis that all the true weights are zero, except for the bias term 0. We contrast the null hypothesis with the alternative hypothesis that at least one of the weights is not zero
H0: 1 0,2 0,,d 0 Ha: i, suchthati 0
The null hypothesis can also be written as H0 : 0, where 1,2, ,d T.
We use the Ftest that compares the ratio of the adjusted RSS value to the
estimated variance2 , defined via the f statistic RSSd RSSd
f 2S S E nd1 2 7 . 2 7
Under the null hypothesis, we have ERSSd2

Multiple Regression 745
To see this, let us examine the regression equations in vector terms, namely Yb1w 1X 1. . .w dX d
YYw1X1 wdXd1w1X1wdXd Y Y1w 1X 1 X 11 . . .w dX d X d1, w h i c h i m p l i e s
d i1
Yw1X1w2X2wdXdLet us consider the RSS value; we have
wiXi
2 2 T RSSYY1 Y YY
d
Td j1
d d
wjXj
wiwjXiTXj wTDTDw

i1
wiXi
where ww1 , w2 ,, wd T is the
without the bias term, and DRnd is the centered data matrix without augmen tation by the X01 attribute. The expected value of RSS is thus given as
ERSSEwTDT Dw
trEwTDT Dw, since, EwTDT Dw is a scalar
EtrwTDT Dw
EtrDT DwwT, since trace is invariant under cyclic permutation
t r DT DEw w T
t r DT D c o vw Ew EwT2 7 . 2 8 trDT Dcovw , since under the null hypothesis Ew0
t r DT D 2DT D12 t rI d d 2
where Id is the dd identity matrix. We also used the fact that
covwEwwTEwEwT, and therefore EwwT covwEwEwT
Notice that from Eq. 27.25, the covariance matrix for the augmented weight vector w, t h a t i n c l u d e s t h e b i a s t e r m , i s g i v e n a s2DT D1 . H o w e v e r , s i n c e w e a r e i g n o r i n g the bias bw0 in the hypothesis test, we are interested only in the lower right dd submatrix of DTD1, which excludes the values related to w0. It can be shown that this submatrix is precisely the inverse of the centered scatter matrix DT D1 for the unaugmented data. We used this fact in the derivation above. Therefore, it follows that
ERSS 1 ERSS1d 2 2 ddd
i1 j1
d dimensional vector of regression coefficients

746 Regression Evaluation
Further, as per Eq. 27.21, the estimated variance is an unbiased estimator, so that E22
Thus, under the null hypothesis the f statistic has a value close to 1, which indicates that there is no relationship between the predictor and response variables. On the other hand, if the alternative hypothesis is true, then ERSSd2, resulting in a larger f value.
The ratio f follows a F distribution with d , nd1 degrees of freedom for the numerator and denominator, respectively. Therefore, we can reject the null hypothesis if the pvalue is less than the chosen significance level, say 0.01.
Noticethat,sinceR21SSE RSS,wehave TSS TSS
SSE1R2TSS RSSR2 TSS Therefore, we can rewrite the f ratio as follows
RSSd nd1 R2
fSSEnd1d1R2
27.29
In other words, the F test compares the adjusted fraction of explained variation to the unexplained variation. If R2 is high, it means the model can fit the data well, and that is more evidence to reject the null hypothesis.
Hypothesis Testing for Individual Parameters
We can also test whether each independent attribute Xi, contributes significantly for the prediction of Y or not, assuming that all the attributes are still retained in the model.
For attribute Xi , we set up the null hypothesis H0 : i0 and contrast it with the alternative hypothesis Ha : i0. Using Eq. 27.26, the standardized variable Zwi under the null hypothesis is given as
wi Ewi wi wi
c 27.30
Zwisew sew
i i ii
since Ewi i0. Next, using a twotailed t test with nd1 degrees of freedom, we compute pvalueZwi . If this probability is smaller than the significance levelsay 0.01, we can reject the null hypothesis. Otherwise, we accept the null hypothesis, which would imply that Xi does not add significant value in predicting the response in light of other attributes already used to fit the model. The ttest can also be used to test whether the bias term is significantly different from 0 or not.
Example 27.9 Hypothesis Testing. We continue with Example 27.8, but now we test for the regression effect. Under the null hypothesis that 120, the expected value of RSS is2 . Thus, we expect the f statistic to be close to 1. Let us check if that

Multiple Regression 747
is the case; we have
fRSSd80.602958.82 4.203102
Using the F distribution with d , nd12, 147 degrees of freedom, we have pvalue958.80
In other words, such a large value of the f statistic is extremely rare, and therefore, we can reject the null hypothesis. We conclude that Y does indeed depend on at least one of the predictor attributes X1 or X2.
We can also test for each of the regression coefficients individually using the t test. For example, for 1 , let the null hypothesis be H0 : 10, so that the alternative hypothesis is Ha : 10. Assuming that the model still has both X1 and X2 as the predictor variables, we can compute the tstatistic using Eq. 27.26:
Zw1w10.0821.98 sew10.0414
Using a twotailed t test with nd1147 degrees of freedom, we find that pvalue1.980.0496
Since the pvalue is only marginally less than a significance level of 0.05 i.e., a 95 confidence level, this means that X1 is only weakly relevant for predicting Y in the presence of X2. In fact, if we use the more stringent significance level 0.01, we would conclude that X1 is not significantly predictive of Y, given X2.
On the other hand, individually for 2, if we test whether H0 : 20 versus Ha :2 0,wehave:
Zw2w20.45 23.2 sew20.0194
Using a twotailed t test with nd1147 degrees of freedom, we find that pvalue23.20
This means, that individually X2 is significantly predictive of Y even in the presence of X1.
Using the ttest, we can also compute the pvalue for the bias term: Zbb0.0140.077
which has a pvalue0.94 for a twotailed test. This means, we accept the null hypothesis that 0, and reject the alternative hypothesis that 0.
seb 0.183
Example 27.10 Centered Scatter Matrix. Here we show that the inverse of the centered scatter matrix DTD1 for the unaugmented data is the lower right

748 Regression Evaluation
submatrix of the inverse of the uncentered scatter matrix DTD1 for the augmented data.
For the augmented data comprising X0, X1 and X2, from Example 27.6, the uncentered33scattermatrixD TDanditsinverse,aregivenas
150.0 876.50 563.80 0.793 0.176 0.064 DTD 876.5 5223.85 3484.25 DTD10.176 0.041 0.017
563.8 3484.25 2583.00 0.064 0.017 0.009 For the unaugmented data comprising only X1 and X2, the centered scatter
matrix and its inverse are given as:
DT D 102.17 189.78 DT D1 0.041 0.017
189.78 463.86 0.017 0.009 WecanobservethatDTD1 isexactlythelowerright22submatrixofD TD 1.
27.2.4 Geometric Approach to Statistical Testing
The geometry of multiple regression provides further insight into the hypothesis testing approach for the regression effect. Let XiXiXi1 denote the centered attributevector,andletXX1,X2,,XdT denotethemultivariatecenteredvector of predictor variables. The ndimensional space over the points is divided into three mutually orthogonal subspaces, namely the 1dimensional mean space Sspan1, the d dimensional centered variable space SXspanX, and the nd1 dimensional error space So, which contains the error vector o. The response vector Y can thus be decomposed into three components see Figures 27.2 and 27.5
Y Y1Yo
Recall that the degrees of freedom of a random vector is defined as the dimensionality of its enclosing subspace. Since the original dimensionality of the point space is n, we have a total of n degrees of freedom. The mean space has dimensionality dimS1, thecenteredvariablespacehasdimSXd,andtheerrorspacehasdimSond 1, so that we have
dimSdimSXdimSo1dnd1n
Population Regression Model
Recall that the regression model posits that for a fixed value xixi1,xi2, ,xid T, the true response yi is given as
y i 1x i 1. . . dx i d i
where the systematic part of the model dj 1 jxij is fixed, and the error term i varies randomly, with the assumption that i follows a normal distribution with mean 0 and variance 2. We also assume that the i values are all independent of each other.

Multiple Regression
749
So
Y
E YX

SX
1
Figure 27.6. Population regression model. The n1 dimensional subspace, comprising SXSo , orthogonal to 1 is shown as a hyperrectangle. The true error vector is a random vector of length ; the n1dimensional hypersphere shows the possible orientations offor a fixed length value.
Plugging Ydj 1 jXj into the equation above, we obtain yi Y 1 xi1 X1d xid Xdi
Y 1 xi1 d xid i
where x ijxijXj is the centered value for attribute Xj . Across all the points, we
can rewrite the above equation in vector form
YY 11 X1 d Xd
We can also center the vector Y, so that we obtain a regression model over the centered response and predictor variables, as illustrated in Figure 27.6:
Y YY 11 X1 2 X2 d Xd EYX
In this equation, di1i Xi is a fixed vector that denotes the expected value EYX andis an ndimensional random vector that is distributed according to a ndimensional multivariate normal vector with mean 0, and a fixed variance2 along all dimensions, so that its covariance matrix is2I. The distribution ofis therefore given as
f1expT11 exp2 2n22nn 22
which follows from the fact that detdet2I2n and 11 I. 2
The density ofis thus a function of its squared length 2, independent of its angle. In other words, the vectoris distributed uniformly over all angles and is equally likely to point in any direction. Figure 27.6 illustrates the population regression model. The fixed vector EYX is shown, and one orientation ofis shown. It is important

750 Regression Evaluation
to note that the n1 dimensional hypersphere denotes the fact that the random vectorcan be in any orientation in this hypersphere of radius . Notice how the population regression model differs from the fitted model. The residual error vector o is orthogonal to the predicted mean response vector Y, which acts as the estimate for EYX. On the other hand, in the population regression model, the random error vectorcan be in any orientation compared to EYX.
Hypothesis Testing
Consider the population regression model
Y Y1 1X 1. . . dX d Y1E YX
To test whether X1,X2, ,Xd are useful in predicting Y, we consider what would happen if all of the regression coefficients were zero, which forms our null hypothesis
H0: 1 0,2 0,,d 0 YY 1YY 1Y
Sinceis normally distributed with mean 0 and covariance matrix2I, under the null hypothesis, the variation in Y for a given value of x will therefore be centered around the origin 0.
On the other hand, under the alternative hypothesis Ha that at least one of the i is nonzero, we have
YE YX
Thus, the variation inY is shifted away from the origin 0 in the direction EYX.
In practice, we obviously do not know the true value of EYX, but we can estimate it by projecting the centered observation vectorY onto the subspaces SX and
So , as follows
Yw 1X 1w 2X 2. . .w dX doYo
Now, under the null hypothesis, the true centered response vector is Y, and therefore, Y and o are simply the projections of the random error vectoronto the subspaces SX and So, as shown in Figure 27.7a. In this case, we also expect the length of o and Y to be roughly comparable. On the other hand, under the alternative hypothesis, we have YEY X, and so Y will be relatively much longer compared to o, as shown in Figure 27.7b.
Given that we expect to see a difference between Y under the null and alternative hypotheses, this suggests a geometric test based on the relative lengths of Y and o since we do not know the true EYX or. However, there is one difficulty; we cannot compare their lengths directly, since Y lies in a d dimensional space, whereas o lies in a nd1 dimensional space. Instead, we can compare their lengths after normalizing
In this case, we have

Multiple Regression
751
So

o
Y 1
SX
So
a Null hypothesis: Y
Y Y
SX
by the number of dimensions. Define the mean squared length of per dimension for
the two vectorsY and o, as follows
2 2Y Y
1
b Alternative hypothesis
Figure 27.7. Geometric test for regression effect: Null versus alternative hypothesis.
MY dimSX d
o 2 o 2
Mo dimSond1
Thus, the geometric test for the regression effect is the ratio of the normalized mean
squared length ofY to o, given as
2
MYYd Mo o2nd1

o

752
Regression Evaluation
2 It is interesting to note that from Eq. 27.24 we have Y
2
RSS and o
YY2SSE, and therefore, the geometric ratio test is identical to the Ftest in Eq. 27.27, since
2 MY Yd RSSd
Moo2nd1SSEnd1 f
The geometric approach, illustrated in Figure 27.7, makes it clear that if f1 then the null hypothesis holds, and we conclude that Y does not depend on the predictor variables X1,X2, ,Xd. On the other hand, if f is large, with a pvalue less than the significance level say, 0.01, then we can reject the null hypothesis and accept the alternative hypothesis that Y depends on at least one predictor variable Xi .
27.3 FURTHER READING
For a geometrical approach to multivariate statistics see Wickens 2014, and Saville and Wood 2012. For an excellent treatment of modern statistical inference in the context of regression see Devore and Berk 2012.
Devore, J. and Berk, K. 2012. Modern Mathematical Statistics with Applications. 2nd ed. New York: Springer ScienceBusiness Media.
Saville, D. J. and Wood, G. R. 2012. Statistical Methods: The Geometric Approach. New York: Springer ScienceBusiness Media.
Wickens, T. D. 2014. The Geometry of Multivariate Statistics. New York: Psychology Press, TaylorFrancis Group.
27.4 EXERCISES
Q1. Showthatforbivariateregression,wehaveYX,whereandarethe
multiple regression.
true model parameters.
Q2. Showthatn yi yiyi Y0.
i1
Q3. ProvethatoisorthogonaltoYandX.Showthisforbivariateregressionandthenfor
Q4. Show that for bivariate regression, the R2 statistic is equivalent to the squared correlation between the independent attribute vector X and response vector Y. That is, show that R2X2 Y.
Q5. ShowthatYYY1.
Q6. Showthato YYYY .
Q7. Inbivariateregression,showthatunderthenullhypothesis,ERSS2.
Q8. Inbivariateregression,showthatE22.
Q9. ShowthatERSSd2.

Exercises 753
Q10. Treating each residual oiyiyi as a random variable, show that varoi21hii
where hii is the ith diagonal of the d1d1 hat matrix H for the augmented data D . Next, using the above expression for varoi, show that for bivariate regression, the variance of the ith residual is given as
varoi211 1 xiX2 n sX
Q11. Given data matrix D, let D the centered data matrix, and Dthe augmented data matrix with the extra column X01. Let DTD be the uncentered scatter matrix for the augmented data, and let DT Dbe the scatter matrix for the centered data. Show that the lower right dd submatrix of DTD1 is DT D1.

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] R C algorithm deep learning math scala database graph statistic network CHAPTER 23 Linear Regression
$25