[SOLVED] CS AI algorithm Contents

$25

File Name: CS_AI_algorithm_Contents.zip
File Size: 226.08 KB

5/5 - (1 vote)

Contents
Econometrics (M.Sc.)
Prof. Dr. Dominik Liebl
2021-02-07
Contents 1
Preface 5
Organization of the Course . . . . . . . . . . . . . . . . . . . . . . 5 Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1 Review: Probability and Statistics 7
1.1 ProbabilityTheory .. 7 1.2 RandomVariables 16
1

2 Review: Simple Linear Regression 43
2.1 TheSimpleLinearRegressionModel . . . . . . . . . . . . . 43 2.2 OrdinaryLeastSquaresEstimation . . . . . . . . . . . . . . 47 2.3 PropertiesoftheOLSEstimator. 59
3 Multiple Linear Regression 67
3.1 Assumptions 67 3.2 Deriving the Expression of the OLS Estimator . . . . . . . . 72 3.3 SomeQuantitiesofInterest 74 3.4 MethodofMomentsEstimator . 78 3.5 Unbiasednessof|X . 80 3.6 Varianceof|X. 81 3.7 TheGauss-MarkovTheorem 86
4 Small Sample Inference 89
4.1 Hypothesis Tests about Multiple Parameters . . . . . . . . . 90 4.2 TestsaboutOneParameter 93 4.3 Testtheory. 95 4.4 TypeIIErrorandPower.. 102 4.5 p-Value 105 4.6 ConfidenceIntervals . 105 4.7 Practice:SmallSampleInference 107
5 Large Sample Inference 123
5.1 ToolsforAsymptoticStatistics . 123 5.2 Asymptotics under the Classic Regression Model . . . . . . . 131 5.3 RobustConfidenceIntervals 136 5.4 Practice:LargeSampleInference 137
6 Maximum Likelikhood 149
6.1 LikelihoodPrinciple.. 149 2

6.2 Properties of Maximum Likelihood Estimators . . . . . . . . 149
6.3 The(Log-)LikelihoodFunction.. 150
6.4 Optimization: Non-Analytical Solutions . . . . . . . . . . . . 152
6.5 OLS-EstimationasML-Estimation . . . . . . . . . . . . . . 155
6.6 VarianceofML-Estimators ands2 . . . . . . . . . . . 157 2ML ML
6.7 ConsistencyofMLandsML 158 6.8 Asymptotic Theory of Maximum-Likelihood Estimators . . . 159
7 Instrumental Variables Regression 165
7.1 The IV Estimator with a Single Regressor and a Single Instru- ment. 166
7.2 TheGeneralIVRegressionModel 174
7.3 CheckingInstrumentValidity .. 180
7.4 ApplicationtotheDemandforCigarettes . . . . . . . . . . 182
Bibliography
193
3

Preface
Organization of the Course
Etherpad: Throughout the course, I will use the following etherpad for collecting important links and further information (Zoom-meeting link etc.): https://etherpad.wikimedia.org/p/8e2yF6X6FCbqep2AxW1b
eWhiteboard: Besides this script, I will use an eWhiteboard during the lecture. This material will be saved as pdf-files and provided as accompaning lecture materials.
Lecture Materials: You can find all lecture materials at eCampus or directly at sciebo: https://uni-bonn.sciebo.de/s/iOtkEDajrqWKT8v
Literature
A guide to modern econometrics, by M. Verbeek
Introduction to econometrics, by J. Stock and M.W. Watson
E-Book: https://bonnus.ulb.uni-bonn.de/SummonRecord/FET CH-bonn_catalog_45089983
Econometric theory and methods, by R. Davidson and J.G. MacKinnon A primer in econometric theory, by J. Stachurski
Econometrics, by F. Hayashi
5

Chapter 1
Review: Probability and Statistics
1.1 Probability Theory
Probability is the mathematical language for quantifying uncertainty. We can apply probability theory to a diverse set of problems, from coin flipping to the analysis of econometric problems. The starting point is to specify the sample space, that is, the set of possible outcomes.
1.1.1 Sample Spaces and Events
The sample space , is the set of possible outcomes of an experiment. Points E in are called sample outcomes or realizations. Events are subsets of Example 2.1 If we toss a coin twice then = {HH,HT,TH,TT}. The event that the first toss is heads is A = {HH,HT}.
Example: Let E be the outcome of a measurement of some physical quantity, for example, temperature. Then = R = (=,). The event that the measurement is larger than 10 but less than or equal to 23 is A = (10, 23].
Example: If we toss a coin forever then the sample space is the infinite set = {E = (E1,E2,E3,,)|Ei {H,T}} Let A be
7

the event that the first head appears on the third toss. Then A = {(E1,E2,E3,,)|E1 = T,E2 = T,E3 = H,Ei {H,T} for i > 3}.
Given an event A, let Ac = {E ; E / A} denote the complement of A. Informally, Ac can be read as not A. The complement of is the empty set y. The union of events A and B is defined as
AfiB={E|EAorEBorE both}
which can be thought of as A or B. If A1,A2, is a sequence of sets then
Ai ={E:EAi foratleastonei}. i=1
The intersection of A and B is defined as
A fl B = {E ; E A and E B}
which reads as A and B. Sometimes A fl B is also written shortly as AB. If A1,A2, is a sequence of sets then
Ai ={E:EAi foralli}. i=1
If every element of A is also contained in B we write A B or, equivalently, B A. If A is a finite set, let |A| denote the number of elements in A. We say that A1, A2, . . . are disjoint or mutually exclusive if Ai fl Aj = y whenever i = j. For example, A1 = [0,1),A2 = [1,2),A3 = [2,3), are dtisjoint. A partition of is a sequence of disjoint sets A1, A2, . . . such that
i = 1 A i = .
8

Summary: Sample space and events

E
A
|A| Ac
AfiB A fl B A B y

sample space
outcome
event (subset of )
number of points in A (if A is finite) complement of A(not A)
union (A or B)
intersection (A and B); short notation: AB set inclusion (A is a subset of or equal to B) null event (always false)
true event (always true)
1.1.2 Probability
We want to assign a real number P(A) to every event A, called the proba- bility of A. We also call P a probability distribution or a probability measure. To qualify as a probability, P has to satisfy three axioms. That is, a function P that assigns a real number P(A) [0,1] to each event A is a probability distribution or a probability measure if it satisfies the following three axioms:
Axiom 1: P(A) 0 for every A Axiom 2: P() = 1
Axiom 3: If A1, A2, . . . are disjoint then
P Qa A i Rb = y P ( A i ) . i=1 i=1
Note: It is not always possible to assign a probability to every event A if the sample space is large, such as, for instance, the whole real line, = R. In case
9

of = R strange things can happen. There are pathological sets that simply break down the mathematics. An example of one of these pathological sets, also known as non-measurable sets because they literally cant be measured (i.e. we cannot assign probabilities to them), are the Vitali sets. Therefore, in such cases like = R, we assign probabilities to a limited class of sets called a -field or -algebra. For = R, the canonical -algebra is the Borel -algebra. The Borel -algebra on R is generated by the collection of all open subsets of R.
One can derive many properties of P from the axioms. Here are a few:
P(y)=0
ABP(A)P(B)
0P(A)1
P(Ac)=1=P(A)
AflB=yP(AfiB)=P(A)+P(B)
A less obvious property is given in the following: For any events A and B we have that,
P (A fi B) = P (A) + P (B) = P (AB).
Example. Two coin tosses. Let H1 be the event that heads occurs on toss 1 and let H2 be the event that heads occurs on toss 2. If all outcomes are equally likely, that is, P({H1,H2}) = P({H1,T2}) = P({T1,H2}) = P ({T1, T2}) = 1/4, then
P (H1 fi H2) = P (H1) + P (H2) = P (H1H2) = 1 + 1 = 1 = 3. 2244
10

Probailities as frequencies. One can interpret P(A) in terms of fre- quencies. That is, P(A) is the (infinitely) long run proportion of times that A is true in repetitions. For example, if we say that the probability of heads is 1/2, i.e P(H) = 1/2 we mean that if we flip the coin many times then the proportion of times we get heads tends to 1/2 as the number of tosses increases. An infinitely long, unpredictable sequence of tosses whose limiting proportion tends to a constant is an idealization, much like the idea of a straight line in geometry.
The following R codes approximates the probability P(H) = 1/2 using 1, 10 and 100,000 many (pseudo) random coin flips:
set.seed(869)
## 1 (fair) coin-flip:
results <- sample(x = c(“H”, “T”), size = 1) ## Relative frequency of “H” in 1 coin-flips length(results[results==”H”])/1#> [1] 1
## 10 (fair) coin-flips:
results <- sample(x = c(“H”, “T”), size = 10, replace = TRUE) ## Relative frequency of “H” in 10 coin-flips length(results[results==”H”])/10#> [1] 0.3
## 100000 (fair) coin-flips:
results <- sample(x = c(“H”, “T”), size = 100000, replace = T) ## Relative frequency of “H” in 100000 coin-flips length(results[results==”H”])/100000#> [1] 0.50189
11

1.1.3 Independent Events
If we flip a fair coin twice, then the probability of two heads is 1 1. We 22
multiply the probabilities because we regard the two tosses as independent. Two events A and B are called independent if
P(AB) = P(A)P(B).
Or more generally, a whole set of events {Ai|i I} is independent if
P Qa A i Rb = Y P ( A i ) iJ iJ
for every finite subset J of I, where I denotes the not necessarily finite index set (e.g. I = {1,2,}).
Independence can arise in two distinct ways. Sometimes, we explicitly assume that two events are independent. For example, in tossing a coin twice, we usually assume the tosses are independent which reflects the fact that the coin has no memory of the first toss.
In other instances, we derive independence by verifying that the definition of independence P(AB) = P(A)P(B) holds. For example, in tossing a fair die once, let A = {2,4,6} be the event of observing an even number and let B = {1,2,3,4} be the event of observing no 5 and no 6. Then, AflB={2,4}istheeventofobservingeithera2ora4. AretheeventsA and B independent?
P(AB) = 2 = P(A)P(B) = 1 2 623
and so A and B are independent. In this case, we didnt assume that A and B are independent it just turned out that they were.
12

Cautionary Notes. Suppose that A and B are disjoint events (i.e. AB = y), each with positive probability (i.e. P(A) > 0 and P(B) > 0).
Can they be independent? No! This follows since P(AB) = P(y) = 0 = P(A)P(B) > 0.
Except in this special case, there is no way to judge (in-)dependence by looking at the sets in a Venn diagram.
Summary: Independence
1. A and B are independent if P (AB) = P (A)P (B).
2. Independence is sometimes assumed and sometimes derived. 3. Disjoint events with positive probability are not independent.
1.1.4 Conditional Probability
If P(B) > 0 then the conditional probability of A given B is
P(A | B) = P(AB). P(B)
Think of P(A | B) as the fraction of times A occurs among those in which B occurs. Here are some facts about conditional probabilities:
The rules of probability apply to events on the left of the bar |. That is, for any fixed B such that P(B) > 0,P( | B) is a probability i.e. it satisfies the three axioms of probability: P (A | B) 0, P ( | B) = 1 and if A1,A2, are disjoint then P (ti=1 Ai | B) = qi=1 P (Ai | B).
ButitsgenerallynottruethatP(A|BfiC)=P(A|B)+P(A|C). 13

In general it is also not the case that P(A | B) = P(B | A). People get this confused all the time. For example, the probability of spots given you have measles is 1 but the probability that you have measles given that you have spots is not 1. In this case, the dierence between P(A | B) and P(B | A) is obvious but there are cases where it is less obvious. This mistake is made often enough in legal cases that it is sometimes called the
prosecutors fallacy.
Example. A medical test for a disease D has outcomes + and =. The
probabilities are:
D Dc
+ .0981 = .9019
.0081 .0900 .0009 .9010
.0090 .9910 1
From the definition of conditional probability, we have that:
Sensitivity of the test:
P(+ | D) = P(+ fl D)/P(D) = 0.0081/(0.0081 + 0.0009) = 0.9
Specificity of the test:
P (= | Dc) = P (= fl Dc)/P (Dc) = 0.9010/(0.9010 + 0.0900) 0.9
Apparently, the test is fairly accurate. Sick people yield a positive test result 90 percent of the time and healthy people yield a negative test result about 90 percent of the time. Suppose you go for a test and get a positive result. What is the probability you have the disease? Most people answer 0.90 = 90%. The correct answer is P(D | +) = P(+ fl D)/P(+) = 0.0081/(0.0081 + 0.0900) = 0.08. The lesson here is that you need to compute the answer numerically. Dont trust your intuition.
14

If A and B are independent events then
P(A | B) = P(AB) = P(A)P(B) = P(A)
P(B) P(B)
So another interpretation of independence is that knowing B doesnt
change the probability of A.
From the definition of conditional probability we can write
P(AB) = P(A | B)P(B) and also P(AB) = P(B | A)P(A). Often, these formulas give us a convenient way to compute P(AB) when A
and B are not independent.
Note, sometimes P(AB) is written as P(A,B).
Example. Draw two cards from a deck, without replacement. Let A be the event that the first draw is Ace of Clubs and let B be the event that the second draw is Queen of Diamonds. Then P (A, B) = P (A)P (B | A) =
(1/52) (1/51)
Summary: Conditional Probability
1. If P(B) > 0 then P(A | B) = P(AB)/P(B)
2. P( | B) satisfies the axioms of probability, for fixed B. In general,
P (A | ) does not satisfy the axioms of probability, for fixed A.
3. Ingeneral,P(A|B)=P(B|A).
4. A and B are independent if and only if P(A | B) = P(A).
15

1.2 Random Variables
Statistics and econometrics are concerned with data. How do we link sample spaces, events and probabilities to data? The link is provided by the concept of a random variable. A real-valued random variable is a mapping X : R that assigns a real number X(E) R to each outcome E.
At a certain point in most statistics/econometrics courses, the sample space, , is rarely mentioned and we work directly with random variables. But you should keep in mind that the sample space is really there, lurking in the background.
Example. Flip a coin ten times. Let X(E) be the number of heads in the sequence E. For example, if E = HHTHHTHHTT then X(E) = 6.
Example. Let = {(x, y)|x2 + y2 1} be the unit disc. Consider drawing a point at random from . A typical outcome is then of the form E = (x, y). Some examples of random variables are X(E) = x, Y (E) = y, Z(E) = x + y,W(E)=Ox2 +y2.
Given a real-valued random variable X R and a subset A of the real line (A R), define X=1(A) = {E |X(E) A}. This allows us to link the probabilities on the random variable X, i.e. the probabilities we are usually working with, to the underlying probabilities on the events, i.e. the probabilities lurking in the background.
Example. Flip a coin twice and let X be the number of heads. Then, PX(X = 0) = P({TT}) = 1/4, PX(X = 1) = P({HT,TH}) = 1/2 and PX(X = 2) = P({HH}) = 1/4. Thus, the events and their associated
16

probability distribution, P , and the random variable X and its distribution, PX, can be summarized as follows:
E P ({E}) X(E) TT1/4 0 TH1/4 1 HT1/4 1 HH1/4 2
x PX(X=x) 01/4
11/2
21/4
Here, PX is not the same probability function as P because P maps from the sample space events, E, to [0, 1], while PX maps from the random-variable events, X(E), to [0,1]. We will typically forget about the sample space and just think of the random variable as an experiment with real-valued
(possible multivariate) outcomes. We will therefore write P (X = xk) instead of PX (X = xk) to simplify the notation.
1.2.1 Univariate Distribution and Probability Functions
1.2.1.1 Cumulative Distribution Function The cumulative distribution function (cdf)
FX :R[0,1]
of a real-valued random variable X R is defined by
FX(x) = P(X x).
You might wonder why we bother to define the cdf. The reason is that it eectively contains all the information about the random variable. Indeed, let X R have cdf F and let Y R have cdf G. If F(x) = G(x) for all xRthenP(XA)=P(Y A)forallAR. Inordertodenotethat
17

two random variables, here X and Y , have the same distribution, one can writeshortlyX=d Y.
Caution: Equality in distribution, X =d Y , does generally not mean equal- ity in realizations, that is X =d Y X(E) = Y(E) for all E .
The defining properties of a cdf. A function F mapping the real line to [0, 1], short F : R [0, 1], is called a cdf for some probability measure P if and only if it satisfies the following three properties:
1. F is non-decreasing i.e. x1 < x2 implies that F (x1) F (x2). 2. F is normalized: limx= F (x) = 0 and limx F (x) = 13. F is right-continuous, i. e. F(x) = F (x+) for all x, whereF1x+2= lim F(y). yx,y>x
Alternatively to cumulative distribution functions one can use probabil- ity (mass) functions in order to describe the probability law of discrete random variables and denstiy function in order to describe the probability law of continuous random variables.
1.2.1.2 Probability Functions for Discrete Random Variables.
A random variable X is discrete if it takes only countably many values X {x1,x2,}.
For instance, X {1, 2, 3} or X {2, 4, 6, . . . } or X Z or X Q. 18

We define the probability function or probability mass function (pmf) for X by
fX(x) = P(X = x) for all x {x1,x2,}
1.2.1.3 Density Functions for Continuous Random Variables.
A random variable X is continuous if there exists a function fX such that 1. fX(x)0forallx
2.s fX(x)dx=1and =
3. P(a 0, moment is given by
k = EEXkE = + xkfX(x)dx. 25 =

The kth, k > 1, central moment is given by
ck =EE(X=E[X])kE = +(x=)kfX(x)dx,
where = E(X).
=
Note. Moments determine the tail of a distribution (but not much else); see Lindsay and Basak (2000). Roughly: The more moments a distribution has the faster converge its tails to zero. Distributions with compact supports (e.g. the uniform distribution U[a,b]) have infinitely many moments. The Normal distribution has also infinitely many moments even though this distribution has not a compact support since (x) > 0 for all x R. .
1.2.6.1 Law of Total Expectation
As long as we do not fix the values of the conditioning variables, X2, . . . , Xd, they are random variables. Consequently, the conditional mean is generally itself a random variable
E(X1|X2,,Xd) = x1f(x1 | X2,,Xd)dx1.
Note that f(x1 | X2,,Xd) is just a transformation of the random variables X2,,Xd. So we can easily compute the unconditional mean E(X1) by taking the mean of E(X1|X2, . . . , Xd) as following,
E1E(X1|X2,,Xd)2 =
= x1f(x1 | x2,,xd)dx1 fX2,,Xd(x2,,xd)dx2 dxd
= x1 3 f(x1,x2,,xd)dx2 dxd4 dx1
= x1fX1 (x1)dx1 = E(X1).
26

The result that E1E(X1|X2,,Xd)2 = E(X1) is called law of total expectation or law of iterated expectation.
1.2.7 Independent Random Variables
Random variables X1, . . . , Xd are mutually independent if for all x =
(x1,,xd)O it is true that
F (x1, . . . , xd) = F1(x1) F2(x2) . . . Fd(xd)
f(x1, . . . , xd) = f1(x1) f2(x2) . . . fd(xd) The following holds true:
Two real-valued random variables X and Y are independent from each other if and only if the marginal density of X equals the conditional densityofXgivenY =yforallyR,
fX(x)=fX|Y(x|y) forallyR.
Of course, the same statement applies to the marginal density of Y givenX=xforallxR. Thatis,XandY aretwoindependent real-valued random variables if and only if fY (y) = fY |X (y | x) for all x R.
If a real-valued random variable X is independent from a real-valued random variable Y , then the conditional mean of X given Y = y equals the unconditional mean of X for all y R (i.e. the regression function becomes a constant)
E(X|Y =y)=E(X) forallyR.
Of course, the same statement applies to the conditional mean of Y givenX=xforallxR;i.e.,ifXandY aretwoindependent randomvariables,thenE(Y |X=x)=E(Y) forallxR.
27

Cautionary note. The properties that E(X | Y = y) = E(X) for allyRorthatE(Y |X=x)=E(Y)forallxR,donotimply that Y and X are independent.
1.2.8 I.I.D. Samples
Tradition dictates that the sample size is denoted by the natural number n {1,2,}. A random sample is a collection X = (X1,,Xn) of random variables X1,,Xn. If X1,,Xn are all independent from each other and if each random variable has the same marginal distribution, we say that the random sample
X = (X1, . . . , Xn) is i.i.d. (independent and identically distributed).
1.2.9 Some Important Discrete Random Variables 1.2.9.1 The Discrete Uniform Distribution
Let k > 1 be a given integer. Suppose that X has probability mass function
given by
We say that X has a uniform distribution on {1, . . . , k}.
f(x) = Y] 1/k for x = 1,,k [ 0 otherwise.
set.seed(51)
## Set the parameter k
k <- 10## Draw one realization from the discrete uniform distribution sample(x = 1:k, size = 1, replace = TRUE)#> [1] 7
28

1.2.9.2 The Bernoulli Distribution
Let X represent a possibly unfair coin flip. Then P (X = 1) = p and P (X = 0) = 1 = p for some p [0, 1]. We say that X has a Bernoulli distribution written X Bernoulli(p). The probability function is f (x) = px(1 = p)1=x for x {0,1}
set.seed(51)
## Set the parameter p
p <- 0.25## Draw n realization from the discrete uniform distribution n <- 5sample(x = c(0,1), size = n, prob = c(1-p, p), replace=TRUE) #> [1] 1 0 0 1 0
## Alternatively:
## (Bernoulli(p) equals Binomial(1,p)) rbinom(n = n, size = 1, prob = p)
#> [1] 1 1 0 1 0
1.2.9.3 The Binomial Distribution
Suppose we have a coin which falls heads with probability p for some p [0, 1]. Flip the coin n times and let X be the number of heads (or successes). Assume that the tosses are independent. Let f(x) = P(X = x) be the mass function.
ItcanbeshownthatY_] Qa nRbpx(1=p)n=x forx=0,,n f(x) = _[ 0 x otherwise.
A random variable with this mas function is called a binomial random variable and we write X Binomial(n, p). If X1 Binomial (n1, p1) and
29

X2 Binomial(n2, p) and if X1 and X2 are independent, then X1 + X2 Binomial (n1 + n2, p)
set.seed(51)
## Set the parameters n and p size <- 10 # number of trials p <- 0.25 # prob of success## Draw n realization from the binomial distribution:n <- 5rbinom(n = n, size = size, prob = p) #> [1] 4 1 2 6 1
1.2.10 Some Important Continuous Random Variables
1.2.10.1 The Uniform Distribution
X has a Uniform(a, b) distribution, written X Uniform(a, b), if
where a < b. The distribution function is Yf(x)=Y] 1 b=afor x [a,b] [ 0 otherwise _0 x b
## Drawing from the uniform distribution:
n <- 10 a <- 0 30b <- 1runif(n = n, min = a, max = b)#> [1] 0.83442365 0.75138318 0.40601047 0.97101998 0.11233151 #> [6] 0.50750617 0.69714201 0.17104008 0.25448233 0.01813812
1.2.10.2 The Normal (or Gaussian) Distribution
X has a Normal (or Gaussian) distribution with parameters and , denoted
by X N (, 2) , if
f(x)=O2fiexp =22(x=) , xR
where R and > 0. Later we shall see that is the center (or mean of the distribution and is the spread (or standard deviation) of the distribution. The Normal plays an important role in probability and statistics. Many phenomena in nature have approximately Normal distributions. The Central Limit Theorem gives a special role to the Normal distribution by stating that the distribution of averages of random variables can be approximated by a Normal distribution.
We say that X has a standard Normal distribution if = 0 and = 1. Tradition dictates that a standard Normal random variable is denoted by Z. The PDF and CDF of a standard Normal are denoted by (z) and (z). There is no closed-form expression for . Here are some useful facts:
(i) IfXN(,2)thenZ=(X=)/N(0,1) (ii) IfZN(0,1)thenX=+ZN(,2)
(iii) If Xi N (i,i2),i = 1,,n are independent then
yn Qynyn2R Xi N a i, i b .
i=1 i=1 i=1 31
1 I 1 2J

The following R-codes plots the standard Normal density function:
Standard Normal Density Function
# draw a plot of the N(0,1) PDF
curve(dnorm(x),
xlim = c(-3.5, 3.5),
ylab = Density,
main = Standard Normal Density Function)
3 2 1 0 1 2 3 x
This is how you can draw realizations from pseudo random Normal variables:
## Drawing from the uniform distribution:
n <-12mu <-0sigma <- 1rnorm(n = n, mean = mu, sd = sigma)#> [1] 0.085602504 -0.695791615 -1.364410561 -0.183503290
32
Density
0.0 0.1 0.2 0.3 0.4

#>[5] -1.6753470760.0073035510.3469651870.037914318
#>[9]0.881345676 -0.882815597 -0.883560071 -0.795629557
An extension of the normal distribution in a univariate setting is the multivariate normal distribution. Let X = (X1, . . . , Xk)O be a k-dimensional normal variable, short X Nk(,) with mean vector E(X) = Rk and covariance matrix Cov(X) = . The joint density function or proba- bility density function (pdf) of the k-dimensional multivariate normal distribution is
exp1=1(x=)O=1(x=)2 fX (x1,,xk) = 2 O(2fi)k|| ,
where || denotes the determinante of . For k = 2 we have the bivariate pdf of two random normal variables, X and Y say
g X , Y ( x , y ) = O1 2fiXY 1=fl2XY
Ax= BAy= B Ay= B2TZ^
X
Y] 1 SAx= B2 exp[ 2 U x =2fl
X Y + Y V.
=2(1 = flXY ) X XY
Y
Y
Lets consider the special case where X and Y are independent standard normal random variables with densities fX(x) and fY (y). We then have the parameters X = Y = 1, X = Y = 0 (due to marginal standard normality) and correlation flXY = 0 (due to independence). The joint density of X and Y then becomes
g (x,y)=f (x)f (y)= 1 expI=1Ex2 +y2EJ. X,Y X Y 2fi 2
1.2.10.3 The Chi-Squared Distribution
The chi-squared distribution is another distribution relevant in economet- rics. It is often needed when testing special types of hypotheses frequently
33

encountered when dealing with regression models.
The sum of M squared independent standard normal distributed random
variables, Z1, . . . , ZM follows a chi-squared distribution with M degrees of freedom:
2 2 yM 2 2 Z1 ++ZM = Zm M.
m=1
A 2 distributed random variable with M degrees of freedom has expectation M, mode at M = 2 for M 2 and variance 2 M.
Using the code below, we can display the pdf and the distribution function or cumulated density function (cdf) of a 23 random variable in a single plot. This is achieved by setting the argument add = TRUE in the second call of curve(). Further we adjust limits of both axes using xlim and ylim and choose dierent colors to make both functions better distinguishable. The plot is completed by adding a legend with help of legend().
# plot the PDF
curve(dchisq(x, df = 3), xlim = c(0, 10),
ylim = c(0, 1),
col = blue,
ylab = ,
main = pdf and cdf of Chi-Squared Distribution, M = 3)
# add the CDF to the plot
curve(pchisq(x, df = 3), xlim = c(0, 10),
add = TRUE,
col = red)
# add a legend to the plot
34

legend(topleft, c(PDF, CDF),
col = c(blue, red), lty = c(1, 1))
pdf and cdf of ChiSquared Distribution, M = 3
PDF CDF
0 2 4 6 8 10
x
Since the outcomes of a 2M distributed random variable are always positive, the support of the related PDF and CDF is R0.
As expectation and variance depend (solely!) on the degrees of freedom, the distributions shape changes drastically if we vary the number of squared standard normals that are summed up. This relation is often depicted by overlaying densities for dierent M, see the Wikipedia Article.
We reproduce this here by plotting the density of the 21 distribution on the interval [0, 15] with curve(). In the next step, we loop over degrees of freedom M = 2, , 7 and add a density curve for each M to the plot. We also adjust the line color for each iteration of the loop by setting col = M. At last, we add a legend that displays degrees of freedom and the associated colors.
35
0.0 0.4 0.8

# plot the density for M=1
curve(dchisq(x, df = 1), xlim = c(0, 15),
xlab = x,
ylab = Density,
main = Chi-Square Distributed Random Variables)
# add densities for M=2,,7 to the plot using a for() loop
for (M in 2:7) { curve(dchisq(x, df = M),
xlim = c(0, 15), add = T,
col = M)
}
# add a legend
legend(topright, as.character(1:7),
col = 1:7 , lty = 1,
title = D.F.)
36

ChiSquare Distributed Random Variables
D.F.
1 2 3 4 5 6 7
0 5 10 15 x
Increasing the degrees of freedom shifts the distribution to the right (the mode becomes larger) and increases the dispersion (the distributions variance grows).
1.2.10.4 The Student t Distribution
Let Z be a standard normal random variable, W a 2 random variable and further assume that Z and W are independent. Then it holds that
O Z =: X t W/
and X follows a Student t distribution (or simply t distribution) with degrees of freedom.
The shape of a t distribution depends on . t distributions are sym- metric, bell-shaped and look similar to a normal distribution, especially when is large. This is not a coincidence: for a suciently large , the t
37
Density
0.0 0.4 0.8

distribution can be approximated by the standard normal distribution. This approximation works reasonably well for 30.
A t distributed random variable X has an expectation if > 1 and it has a variance if > 2.
E(X) =0, M > 1 Var(X)= M ,M>2
M=2
Let us plot some t distributions with dierent degrees of freedoms and
compare them to the standard normal distribution.
# plot the standard normal density
curve(dnorm(x),
xlim = c(-4, 4),
xlab = x,
lty = 2,
ylab = Density,
main = Densities of t Distributions)
# plot the t density for M=2
curve(dt(x, df = 2), xlim = c(-4, 4),
col = 2,
add = T)
# plot the t density for M=4
curve(dt(x, df = 4), xlim = c(-4, 4),
col = 3,
add = T)
38

# plot the t density for M=25
curve(dt(x, df = 25), xlim = c(-4, 4),
col = 4,
add = T)
# add a legend
legend(topright,
c(N(0, 1), M=2, M=4, M=25), col = 1:4,
lty = c(2, 1, 1, 1))
Densities of t Distributions
N(0, 1) M=2 M=4 M=25
4 2 0 2 4 x
The plot illustrates that as the degrees of freedom increase, the shape of the t distribution comes closer to that of a standard normal bell curve. Already for = 25 we find little dierence to the standard normal density.
39
Density
0.0 0.1 0.2 0.3 0.4

If is small, we find the distribution to have heavier tails than a standard normal.
1.2.10.5 Cauchy Distribution
The Cauchy distribution is a special case of the t distribution corresponding to = 1. The density is
f(x) = 1 . fi(1+x2)
For the Cauchy distribution, the expectation does not exist that is, it has no mean. Lets try to compute the mean of a Cauchy distribution and see what goes wrong. Its mean should be
= E(X) = xdx = fi(1+x2)
.
In order for this improper integral to exist, we need both integrals s=0 and s0 to be finite. Lets look at the second integral.
xdx = 1 log11+x22- = 0 fi(1+x2) 2fi -0
Similarlys, the other integral, s=0, is =. Since theyre not both finite, the integral doesnt exist. In other words = is not a number. Thus,
the Cauchy distribution has no mean.
What this means in practice is that if you take a sample x1, x2, . . . , xn
from the Cauchy distribution, then the average x does not tend to a particular number. Instead, every so often you will get such a huge number, either positive or negative, that the average is overwhelmed by it.
40
=

1.2.10.6 The F Distribution
Another ratio of random variables important to econometricians is the ratio of two independent 2 distributed random variables that are divided by their degrees of freedom M and n. The quantity
W/M FM,n with W 2M , V 2n V/n
follows an F distribution with numerator degrees of freedom M and denomi- nator degrees of freedom n, denoted FM,n. The distribution was first derived by George Snedecor but was named in honor of Sir Ronald Fisher.
By definition, the support of both PDF and CDF of an FM,n distributed random variable is R0.
Say we have an F distributed random variable Y with numerator degrees of freedom 3 and denominator degrees of freedom 14 and are interested in P (Y 2). This can be computed with help of the function pf(). By setting theargumentlower.tailtoFALSEweensurethatRcomputes1=P(Y 2), i.e,the probability mass in the tail right of 2.
We can visualize this probability by drawing a line plot of the related density and adding a color shading with polygon().
pf(2, df1 = 3, df2 = 14, lower.tail = F) #> [1] 0.8396462
# define coordinate vectors for vertices of the polygon
x <- c(2, seq(2, 10, 0.01), 10)y <- c(0, df(seq(2, 10, 0.01), 3, 14), 0)# draw density of F_{3, 14}curve(df(x ,3 ,14), 41ylim = c(0, 0.8),xlim = c(0, 10),ylab = “Density”,main = “Density Function”)# draw the polygonpolygon(x, y, col = “orange”) Density Function0 2 4 6 8 10 xThe F distribution is related to many other distributions. An important special case encountered in econometrics arises if the denominator degrees of freedom are large such that the FM,n distribution can be approximated by the FM, distribution which turns out to be simply the distribution of a 2M random variable divided by its degrees of freedom M, i.e.W/MFM, with W2M. 42Density0.0 0.2 0.4 0.6 0.8Chapter 2Review: Simple Linear Regression2.1 The Simple Linear Regression Model The Data-Generating ProcessWe assume some outcome, Yi R, and the explanatory variable, Xi R, are related to each other via the following linear model:Yi =0 +1Xi +Ai, (2.1) where i = 1,…,n indexes the individual observations and n denotes thesample size. Equation (2.1) describes the data generating process that generatesthe individual observations that we observe in the data. The parameters 0 R and 1 R are fixed (deterministic) parameters and their unknown values are the objects of our inquiry. The explanatory variables, Xi, may be either fixed (deterministic) or random (fixed/deterministic or random design). While deterministic Xi remain fixed in repeated samples, random Xi have dierent realiza- tions in repeated samples. In case of random designs, we assume that43each Xi is drawn from the same distribution and that the individual Xis are independent from each other. In this case we say that the Xis are independent and identically distributed (i.i.d.).Examples:Fixed Design (FD): Xi = amount of money paid out to indi-vidual i in an economic laboratory experiment.Random Design (RD): Xi = years of education of the randomlysampled individual i. The error terms A1, . . . , An are random variables representing the non-systematic and/or unobserved influences on Yi. The dependent variable Yi is always random since the error terms Aiare assumed random, for all i = 1,…,n.2.1.1 Assumptions About the Error Term2.1.1.1 The Distribution of AiA typical distributional assumption on the error terms is the i.i.d. assump- tion. That is, each Ai is drawn from the same (but unknown) distribution and two error terms Ai and Aj are independent of each other for all i = j. This is the assumption typically adopted when randomly sampling data from a large population.Sometimes we need to be even more specific, for instance, by assuming that the error terms A1, . . . , An are i.i.d. normal, i.e.i.i.d. 2Ai N(0, ) for all i = 1,…,n. 44Note. The above i.i.d. assumptions are not as restrictive as they may seem on first sight. They allow that the conditional variances are heteroscedastic, i.e. Var(Ai |Xi) = i2 with dierent variance terms i2 across i = 1, . . . , n.The literature often distinguishes between the following general scenarios for the conditional variances of the error terms:Homoscedastic error terms: The conditional variances Var(Ai|Xi = xi) = 2 are equal to some constant 2 > 0 for every possible realization xi of Xi. (Due to the i.i.d. setting we then additionally have that Var(Ai|Xi = xi) = 2 for all i = 1,,n.)
Heteroscedastic error terms: The conditional variances Var(Ai|Xi =
xi) = 2(xi) are equal to a non-constant variance-function 2(xi) which
is a function of the realization x of X . Example: Var(A |X ) = 1 (X )2. i i i i 12 i
2.1.1.2 The Exogeneity Assumption E[Ai|Xi] = 0
Since the explanatory variables Xi are random variables under the random design, we need to think about possible dependencies between X1, . . . , Xn and the error terms
Ai =Yi =0 =1Xi, i=1,,n.
An assumption of central importance is that the conditional mean of Ai
given Xi is zero, i.e.
E[Ai|Xi] = 0, for all i = 1,,n.
In other words, on average, the non-systematic determinants of Yi are zero no matter the value of Xi. That is, this assumption is a statement about the relationship between Ai and Xi. It turns out that the assumption E[Ai|Xi] = 0
45

is one of the most important assumptions we make in econometrics. When it is violated it causes fundamental econometric problems. Unfortunately, it is generally impossible to check this assumption since we do not observe realizations of the error terms A1, . . . , An.
Terminology: The exogeneity assumption E[Ai|Xi] = 0, for all i = 1, . . . , n, is also known as the orthogonality assumption.
Implications of the exogeneity assumption:
The assumption that E[Ai|Xi] = 0, for all i = 1, . . . , n, implies that the correlation between Xi and Ai is zero for all i = 1, . . . , n, although the converse is not true.
The assumption that E[Ai|Xi] = 0, for all i = 1, . . . , n, implies also that E[Ai] = 0 for all i = 1,,n.
Note: Under a fixed design, we simply assume that E[Ai] = 0 for all i = 1,,n.
2.1.2 The Population Regression Line
We now have settled all assumptions regarding the simple linear regression model (see Chapter 3 for a concise overview). This allows us to think about the relationship between Yi and Xi under the regime of the model assumptions. For instance, note that we can write
E[Yi|Xi] = E[0 + 1Xi + Ai|Xi] = 0 + 1Xi + E[Ai|Xi].
Then, by our exogeneity assumption, E[Ai|Xi] = 0, we have that E[Yi|Xi] = 0 + 1Xi
46

This conditional expectation of Yi given Xi defines the population regres- sion line we do not observe this object, but want to estimate it.
2.1.3 Terminology: Estimates versus Estimators
Our goal is to devise some way to estimate the unknown parameters 0 R
and 1 R which define the population regression line. We call our method
or formula for estimating these parameters an estimator. An estimator is
a function of the random sample ((Y1, X1), . . . , (Yn, Xn)) and, therefore, is
itself a random variable that (generally) has dierent realizations in repeated
samples. The result of applying our estimator to specific data, i.e., to a given
observed realization ((Yobs,Xobs),,(Yobs,Xobs)) of the random sample 11 nn
((Y1, X1), . . . , (Yn, Xn))) is called an estimate.
Caution: We will usually not use dierent notations for random vari- ables/estimators and their realizations/estimates since often both points of views tell a necessary story.
2.2 Ordinary Least Squares Estimation
There are many ways to estimate the parameters 0 and 1 of the simple linear regression model in (2.1). In the following, we will derive the Ordinary Least Squares (OLS) estimator using a mechanical approach. In the next chapter, we will use an alternative approach (methods of moments) which provides a perspective that is particularly interesting/useful in econometrics. You will see that the properties of the estimators for 0 and 1 crucially depend on our assumption regarding the error terms Ai and the explanatory variables Xi.
47

2.2.1 Terminology: Sample Regression Line, Prediction and Residuals
Let us define some necessary terms. Call our estimate of and our 001
estimate of 1. Now, define the predicted value, Yi, of the dependent variable, Yi, to be
Y = + X (2.2) i01i
This is just the prediction of the dependent variable, Yi, given the value of Xi and the estimates and . Equation (2.2) defines the sample regression line. 0 1
Define the residual, Ai, as the dierence between the observed value, Yi, and the predicted value, Y :
A = Y = Y iii
=Y = =X i01i
The residual, Ai, is the vertical distance between the observed value, Yi, and the sample regression line, i.e., the prediction, Y , of Y .
the errors Ai.
A =Y = =X (computable) ii01i
48
i
ii
We must make an important distinction between the residuals, Ai, and
Ai = Yi = 0 = 1Xi (unobservable)
Because 0 and 1 are unknown, we can never know the value of the error
terms A . However, because we actually come up with the estimates and i0
1, and because we observe Xi, we can calculate the residual Ai for each observation. This distinction is important to keep in mind.

Note that we can write
Ai = Yi = 0 + 1Xi
= Yi = E[Yi|Xi].
But for this to make sense, we needed to impose the exogeneity assumption that E[Ai|Xi] = 0, since only then we can identify the population regression line 0 + 1Xi using the conditional mean of Yi given Xi.
2.2.2 Deriving the Expression of the OLS Estimator
The method of ordinary least squares (OLS) estimation has a long history and was first described by Legendre in 1805 although Karl Friedrich Gauss claimed to use OLS since 1795. In 1809 Gauss published his work on OLS which extended the work of Legendre.
The idea of OLS is to choose parameter values that minimize the sum of squared residuals for a given data set. These minimizing parameters are then the estimates of the unknown population parameters. It turns
out that the OLS estimator is equipped with several desirable properties. In a sense, OLS is a purely mechanical method. We will see that it is equivalent to an alternative estimation method called methods of moments which have a more profound econometric motivation, given a certain set of
(moment-)assumptions.
Deriving the OLS estimate algebraically. Our objective is to find the
parameter values and that minimize the sum of squared residuals, 01
Sn(b0, b1), for a given sample (i.e., for given data) ((Y1, X1), . . . , (Yn, Xn)) of 49

the random sample ((Y1, X1), . . . , (Yn, Xn)), where yn 2
= yn ( Y = Y ) 2 i=1 i i
= yn (Yi =b0 =b1Xi)2 i=1
= yn (Yi2 = 2b0Yi = 2b1YiXi + b20 + 2b0b1Xi + b21Xi2). i=1
Sn(b0, b1) = ei i=1
Now partially dierentiate the last line with respect to b0 and b1, respectively. Sn(b0, b1) = yn (=2Yi + 2b0 + 2b1Xi)
b0 i=1
Sn(b0, b1) = yn 1=2YiXi + 2b0Xi + 2b1Xi22
b1 i=1
Next, we want to find the minimizing arguments
( , )O = min arg S (b , b ). 0 1 (b0,b1)R2 n 0 1
For this we set the two partial derivatives equal to zero which gives us two equations that fully determine the values and :
yn yn
n0 = Yi + 1 Xi = 0
i=1 i=1
yn 3 = Y X + X + X 2 4 = 0
i=1 ii 0i 1i
The two latter equations are known as the least squares normal equations.
It is easy to see from the first normal equation that the OLS estimator of 0 is
01
0 =Y =1X (2.3) 50

Substituting into the second normal equation gives 0
0=yn 3=YX+(Y =X )X+X24 i=1ii 1i1i
= yn 3 = X ( Y = Y ) + X ( X = X ) 4 i=1ii 1ii
Solving for gives 1
QnRQnR = = a y X ( Y = Y ) b + a y X ( X = X ) b
ii1ii i=1 i=1
= q ni = 1 ( Y i = Y ) X i
n
1 q (X=X)X
i=1 i i n (Y =Y)(X =X)
= q i=1
qi=1 i i
= qi=1
ii
n (X =X )(X =X )
n (X=X)Y
ii ni = 1 ( X i = X ) 2
The last two lines follow from the useful result that will be discussed in the exercises of this chapter.
Here is some R-code for computing 0 and 1 for a given realization of the random sample ((Y1, X1), . . . , (Yn, Xn)), i.e. for a given (well, here simulated) data set ((Y1, X1), . . . , (Yn, Xn)).
n <- 25 # sample size## simulate dataset.seed(3)X <- runif(n, min = 1, max = 10) error <- rnorm(n, mean = 0, sd = 5) beta0 <- 1 51beta1 <- 2Y <-beta0+beta1*X+error## save simulated data as data framedata_sim <- data.frame(“Y” = Y, “X” = X)## OLS fitlm_obj <- lm(Y~X, data = data_sim)#### Plotpar(family = “serif”)plot(x = data_sim$X, y = data_sim$Y, main=””, axes=FALSE,pch = 16, cex = 0.8, xlab = “X”, ylab = “Y”) axis(1, tick = FALSE)axis(2, tick = FALSE, las = 2)abline(lm_obj, lty=2, lwd = 1.3, col=”darkorange”) abline(a = beta0, b = beta1, lwd=1.3, col=”darkblue”) legend(“topleft”,col=c(“darkorange”, “darkblue”), legend = c(“Sample Regression Line”,”Population Regression Line”), lwd=1.3, lty=c(2,1), bty=”n”) 52 30 25 20 15 105 0Sample Regression Line Population Regression Line2468 X## Estimatescoef(lm_obj)#> (Intercept) X #> -1.561339 2.512011
Interpretation of the Results. The coecients have the usual intercept and slope interpretation. That is, for the unknown parameters 0 and 1 we have that
E[Y|X] =1 with E[Y|X]=0 +1X. X
That is, 1 is the true (unknown) marginal eect of a one unit change in X on Y . Therefore, is the estimated marginal eect of a one unit change in X on Y : 1

E[Y|X] X
=1 with E[Y|X]=0 +1X.
53
Y

2.2.3 Behavior of the OLS Estimates for Resampled Data (conditionally on Xi)
Usually, we only observe the estimates and computed for a given 01
data set. However, in order to understand the statistical properties of the estimators and we need to view them as random variables which yield
01
dierent realizations in repeated samples generated from (2.1) conditionally on X1, . . . , Xn. This allows us then to think about questions like:
Is the estimator able to estimate the unknown parameter-value cor- rectly on average (conditionally on a given set of X1, . . . , Xn)?
Are the estimation results more precise if we have more data?
A first idea about the statistical properties of the estimators and
can be gained using Monte Carlo simulations as following. 0 1
## Sample sizes
n_small <- 10 # small sample size n_large <- 100 # large sample size## True parameter valuesbeta0 <- 1beta1 <- 1## Generate explanatory variables (random design)X_n_small <- runif(n_small, min = 1, max = 10) X_n_large <- runif(n_large, min = 1, max = 10)## Monte-Carlo (MC) Simulation## 1. Generate data 54## 2. Compute and store estimates## Repeat steps 1. and 2. many times set.seed(3)## Number of Monte Carlo repetitions## How many samples to draw from the models rep <- 1000## Containers to store the lm-resultsn_small_list <- vector(mode = “list”, length = rep) n_large_list <- vector(mode = “list”, length = rep)for(r in 1:rep){## Sampling from the model conditionally on X_n_small error_n_small <- rnorm(n_small, mean = 0, sd = 5) Y_n_small <- beta0 + beta1 * X_n_small + error_n_small n_small_list[[r]] <- lm(Y_n_small ~ X_n_small)## Sampling from the model conditionally on X_n_large error_n_large <- rnorm(n_large, mean = 0, sd = 5) Y_n_large <- beta0 + beta1 * X_n_large + error_n_large n_large_list[[r]] <- lm(Y_n_large ~ X_n_large)}## Reading out the parameter estimatesbeta0_estimates_n_small <- rep(NA, rep) beta1_estimates_n_small <- rep(NA, rep) beta0_estimates_n_large <- rep(NA, rep) beta1_estimates_n_large <- rep(NA, rep)for(r in 1:rep){beta0_estimates_n_small[r] <- n_small_list[[r]]$coefficients[1] beta1_estimates_n_small[r] <- n_small_list[[r]]$coefficients[2] 55beta0_estimates_n_large[r] <- n_large_list[[r]]$coefficients[1] beta1_estimates_n_large[r] <- n_large_list[[r]]$coefficients[2] } Now, we have produced realizations of the estimators |X and |X conditionally on 0 1Q1 X1R X=c. .da1 Xnband we have saved these realizations in beta0_estimates_n_small, beta1_estimates_n_small, beta0_estimates_n_large, and beta1_estimates_n_large. This allows us to visualize the behavior of the OLS estimates for the repeatedly sampled data (conditionally on Xi).## Plotting the resultslibrary(“scales”) # alpha() produces transparent colors## Define a common y-axis rangey_range <- range(beta0_estimates_n_small, beta1_estimates_n_small)*1.1## Generate the plotpar(family = “serif”) # Serif fonts## Layout of plotting arealayout(matrix(c(1:6), 2, 3, byrow = TRUE), widths = c(3,1,1)) ## Plot 1plot(x=0, y=0, axes=FALSE, xlab=”X”, ylab=”Y”, type=”n”,xlim=c(1,10), ylim=c(-5,35), main=”Small Sample (n=10)”) axis(1, tick = FALSE); axis(2, tick = FALSE, las = 2)for(r in 1:rep){ 56abline(n_small_list[[r]], lty=2, lwd = 1.3, col=”darkorange”) }abline(a = beta0, b = beta1, lwd=1.3, col=”darkblue”) legend(“topleft”, col=c(“darkorange”, “darkblue”), legend=c( “Sample regression lines from
repeated samples (cond. on X)”,”Population regression line”), lwd=1.3, lty=c(2,1), bty=”n”)## Plot 2plot(x=rep(0,rep), y=beta0_estimates_n_small, axes=FALSE, xlab=””, ylab=””, pch=19, cex=1.2, ylim=y_range,main=expression(hat(beta)[0]~|~X), col=alpha(“red”,0.2)) points(x = 0, y=beta0, pch=”-“, cex = 1.2, col=”black”) text(x=0, y=beta0, labels = expression(beta[0]), pos = 4) ## Plot 3plot(x=rep(0,rep), y=beta1_estimates_n_small, axes=FALSE, xlab=””, ylab=””, pch=19, cex=1.2, ylim=y_range,main=expression(hat(beta)[1]~|~X), col=alpha(“red”,0.2)) points(x = 0, y=beta1, pch=”-“, cex = 1.2, col=”black”) text(x=0, y=beta1, labels = expression(beta[1]), pos = 4) ## Plot 4plot(x=0, y=0, axes=FALSE, xlab=”X”, ylab=”Y”, type=”n”, xlim=c(1,10), ylim=c(-5,35), main=”Large Sample (n=100)”)axis(1, tick = FALSE); axis(2, tick = FALSE, las = 2)for(r in 1:rep){abline(n_large_list[[r]], lty=2, lwd = 1.3, col=”darkorange”) }abline(a = beta0, b = beta1, lwd=1.3, col=”darkblue”)## Plot 5plot(x=rep(0,rep), y=beta0_estimates_n_large, axes=FALSE, xlab=””, ylab=””, pch=19, cex=1.2, ylim=y_range, 57 main=expression(hat(beta)[0]~|~X), col=alpha(“red”,0.2)) points(x = 0, y=beta0, pch=”-“, cex = 1.2, col=”black”) text(x=0, y=beta0, labels = expression(beta[0]), pos = 4) ## Plot 6plot(x=rep(0,rep), y=beta1_estimates_n_large, axes=FALSE, xlab=””, ylab=””, pch=19, cex=1.2, ylim=y_range,main=expression(hat(beta)[1]~|~X), col=alpha(“red”,0.2)) points(x=0, y=beta1, pch=”-“, cex = 1.2, col=”black”) text(x=0, y=beta1, labels = expression(beta[1]), pos = 4)30 20 10030 20 100Small Sample (n=10)Sample regression lines from repeated samples (cond. on X) Population regression line2 4 6 8 10 XLarge Sample (n=100)2 4 6 8 10 X^0 | X ^1 | X 0 1YYThis are promising plots:58^0 | X0 1^1 | X The realizations of |X and |X are scattered around the true (un- 01known) parameter values 0 and 1 for both small and large samples. The realizations of |X and |X concentrate more and more around01the true (unknown) parameter values 0 and 1 as the sample sizeincreases.However, this was only a simulation for one specific data generating process. Such a Monte Carlo simulation does not allow us to generalize these properties. Next we use theoretical arguments to show that these properties also hold in general.2.3 Properties of the OLS EstimatorNote that we derived the OLS estimator for a given realization ((y1, x1), . . . , (yn, xn)) of the random sample ((Y1, X1), . . . , (Yn, Xn)). So, why should OLS be a good estimator in general? Does it behave well when we consider all possible realization of the OLS estimator for conditional resamplings, given X1, . . . , Xn, from all possible data generating processesdescribed by (2.1)? In the following we consider the two questions: Are the conditional means of the estimators and , given X, equalto the true parameter values 0 and 1? 0 1 Are the conditional variances of the estimators and , given X,become small as the smaple size n becomes large?0 1 592.3.1 Mean and Bias of the OLS EstimatorBecause the estimator=a0b=cqn (X=X )YdQR Qc Y =X Rd 1(2.4)ai=1i ib 1 q ni = 1 ( X i = X ) 2 is a function of the random variables in the random sample ((Y1, X1), . . . , (Yn, Xn)), it is itself a random variable. Therefore, has a probability distribution, and we can talk sensibly about its expectation E().Note that the estimator in (2.4) contains both the Yis and the Xis. Therefore, under random designs, we need to be specific which mean (or variance) of the estimator we want to consider. For the random design, we consider conditional means (or variances) given the design matrixQ1 X1R X=c. .d.a1 XnbUsing the conditional mean, E[|X], we can define the conditional bias: Bias(|X) = E[|X] = .For fixed designs, conditioning on X is not false but superfluous since X remains anyways fixed in repeated samples and therefore Bias(|X) = Bias() = E[] = .Unbiasedness of OLS. The OLS estimator is called unbiased, if its conditional expectation is equal to the true parameter, i.e. ifE[|X] = … Bias(|X) = 0 60That is, the conditional mean of the random estimator equals the true parameter value which we want to estimate. Put another way, unbiasedness means that the conditional distribution of the estimator is centered on the true value of the parameter. Of course, the probability of a single estimate being exactly equal to the truth in any particular sample is zero if the error terms Ai (and therefore the Yi) are continuous random variables.To show that is conditionally unbiased, we need to show that its 1conditional expectation is equal to 1:Sqn (X=X )Y T Sqn (X=X )(+X+A) TE[|X]=EUi=1i i|XV=EUi=1i 0 1i i|XV 1 qni=1(Xi = X )2 qni=1(Xi = X )2In case of a deterministic design, X is fixed in repeated saqmples which makes the conditioning on X superfluous since, for instance, (Xi = X )2 is then a constant which allows us to bring it outside of the expectation; the same applies to all other terms that involve only deterministic quantities. So, the conditioning X is actually only necessary under the random design, where it allows us then to bring all terms containing only X variables (or deterministic quantities) outside of the expectation. However, interpretations then must consider the conditioning on X.So, we really only need to concern ourselves with the numerator, which we can write as:nnSnT 0 y(Xi =X )+1 y(Xi =X )Xi +EUy(Xi =X )Ai |XVi=1 i=1 i=1 For the first two terms, note thatn SQnRT 0 y ( X i = X ) = 0 U a y X i b = n X V = 0i=1 i=1 1 yn ( X i = X ) X i = 1 yn ( X i = X ) 2 ,i=1 i=1 61where the latter result follows again from the useful result that will be discussed in the exercises of this chapter. So we are left with1Syn T E[|X]=+n2EU (X=X )A|XV1 1 qi=1(Xi =X) i=1 i iClearly, then, we want the last term here to be zero, so that is conditionally E1 E unbiased. The last term is zero if the factor E q(Xi = X)Ai|X = 0. Forfixed/random designs we have the following expressions: Fixed design: E Eqni=1(Xi = X )Ai|XE = qni=1(Xi = X )E [Ai]Random design: E Eqni=1(Xi = X )Ai|XE = qni=1(Xi = X )E [Ai|Xi]That is, the OLS estimator, , is (conditionally) unbiased if the exogeneity1assumption E[Ai|Xi] = 0 holds (or if E[Ai] = 0 in case of the fixed design). We return, therefore, to the importance of the exogeneity assumption E[A|Xi] since if this assumption does not hold, the estimator is biased and remains biased even asymptotically as n .The unbiasedness of implies that = Y = X is also an unbiased estimator, since 1 0 1E [ | X ] = E 5 Y = X | X 6 01S 1 yn T = E U ( + X + A ) = X | X Vni=10 1i i 11yn = 0 + 1X + n i=1 E[Ai|Xi] = E[1|X]X = 0where the last line follows from the exogeneity assumption E[Ai|X] = 0 and62the unbiasedness of . 1Note that the exogeneity assumption, E[Ai|X] = 0, is the only distribu- tional assumption on the error terms (and the Xis) that is needed to show that the OLS estimator is unbiased. In fact, OLS remains unbiased if the error terms are heteroscedastic and/or autocorrelated. However, under the more restrictive assumption that the error terms are homoscedastic and notautocorrelated, one can show that the OLS estimator is the most ecient (lowest variance) estimator within the family of all unbiased estimators for the regression paraemters 0 and 1. This result is known as the Gauss-Markov Theorem and we will consider this theorem in the context of the multipleregression model in more detail (see Chapter 3). 2.3.2 Variance of the OLS EstimatorRemember: An unbiased estimator can still be a very imprecise estimator if it has a large variance. In order to derive the variance of the OLS estimator one needs more than only the exogeneity assumption. One gets dierent variance expressions for dierent distributional assumptions on the error terms A1, . . . , An. As we consider this in more detail for the multiple regression model, we consider in the following only the most simple case, namely, conditionally homoscedastic i.i.d. error terms. Under this simple scenario it can be shown that12 n=1qni=1Xi2 Var(0|X)=n n=1qni=11Xi=X 22121 Var(1|X)=n n=1qni=11Xi=X 22,(2.5) (2.6)assuming that Var(Ai|X) = 2. That is equations (2.5) and (2.6) are onlysensible for the case of homoscedastic error terms. From equations (2.5) and(2.6) it follows that Var( |X) and Var( |X) become eventually small as 01n which is good since this means that the estimators become more precise as the sample size n increases.63Note that the variance expressions in (2.5) and (2.6) are not really useful in practice since usually we do not know the variance of the error terms 2. Under the i.i.d. assumption we can replace the unknown 2 by plugging-in reasonable estimates such ass2=1ynA2 ors2=1ynA2 ML ni=1 i UB n=2i=1 iwhere s2UB is the unbiased estimator and s2ML the maximum likelihood esti- mator of 2. This leads to the practically relevant variance estimators 12 n=1qni=1Xi2 Var(0|X)=ns n=1qni=11Xi=X 22121 Var(1|X)=ns n=1qni=11Xi=X 22,where s2 = s2ML or s2 = s2UB.2.3.3 Consistency of the OLS EstimatorNow, we have everything together in order to show that the OLS estimators and are consistent estimators of the unknown parameter values 010 and 1. Since the OLS estimator is unbiased, we do not need to botherwith a possible bias-problemprovided the underlying assumptions of thesimple linear regression model are true. Moreover, since the variances of theestimators and converge to zero as n , we have that 0 0n 1 1nthe Mean Squared Error (MSE) converges to zero, i.e. MSE( |X) = E 5( = )2|X6jn jnj=3Bias( |X)42+Var( |X)0 as n for j=0,1. jn jn =064This means that the estimators and converge in mean-square as 0n 1nn . Fortunately, mean-square convergence implies convergence in probability or short as n for j = 0,1.1 The latter result isjn p jexactly what we mean by saying that an estimator is (weakly) consistent. Decomposition of the MSE. Let be some estimator of a univariateparameter , then we can decompose the MSE of as following 5 26MSE()=E (=)SQ R2TWc1 2 1 2d X =EUa =E() + E()= b V S =0deterministic T W X= E WU( = E())2 + 2( = E())(E() = ) + (E() = )2XV 52656 2=E (=E()) +E 2(=E())(E()=) +(E()=) 2 =(Bias())=Var() =0, since E() = E() = 0 3 42= Var() + Bias()1See, for instance, https://www.statlect.com/asymptotic-theory/relations-among-modes-of- convergence65 Chapter 3Multiple Linear RegressionPreamble. In the following we focus on case of random designs X, since this is the more relevant case in econometrics. This point of view requires us to consider conditional means and variances given X. That is, if we would be able to resample from the model, we do so by fixing the in-principle random explanatory variable X.3.1 AssumptionsThe multiple linear regression model is defined by the following assumptions:Assumption 1: The Linear Model Assumption (Data Generating Process)yK k=1Usually, a constant (intercept) is included, in this case Xi1 = 1 for all i. In the following we will always assume that Xi1 = 1 for all i, unless otherwise stated. It is convenient to write equation (3.1) using matrix notationYi = XiO +Ai, i=1,…,n,(1K )(K 1)67Yi =kXik +Ai, i=1,…,n. (3.1)where Xi = (Xi1,…,XiK)O and = (1,…,K)O. Stacking all individual rows i leads toY=X+A, (3.2) (n1) (nK )(K 1) (n1)whereY =c .d, X=c . … . d, and A=c .d.QX11 … X1KRaYnb aXn1 … XnKb aAnbAs already mentioned above, we focus on the random design case. Moreover, we adopt the classical sampling assumption for analyzing data that was randomly sampled from a large population: we assume that the random variables (Yi, Xi1, . . . , XiK ) are i.i.d. across observations i = 1, . . . , n.Assumption 2: ExogeneityE(Ai|Xi) = 0, i = 1,…,nThis assumption demands that the mean of the random error term Ai is zero irrespective of the realizations of Xi. Note that together with the random sampling assumption (in Assumption 1) this assumption implies even strict exogeneity E(Ai |X) = 0 since we have independence across i = 1, . . . , n. Under strict exogeneity, the mean of Ai is zero irrespective of the realizations of X1,…,Xn.Assumption 3: Rank Condition (no perfect multicollinearity)rank(X) = K a.s.This assumption demands that the event of one explanatory variable beinglinearly dependent on the others occurs with a probability equal to zero. 68QY1RQA1R(This is the literal translation of the almost surely (a.s.) concept.) The assumption implies that n K.This assumption is a bit dicey and its violation belongs to one of the classic problems in applied econometrics (keywords: multicollinearity, dummy variable trap, variance inflation). The violation of this assumption harms any economic interpretation as we cannot disentangle the explanatory variables individual eects on Y . Therefore, this assumption is also often called an identification assumption.Assumption 4: Error distribution. Depending on the context (i.e., parameter estimation vs. hypothesis testing and small n vs. large n) there are dierent more or less restrictive assumptions. Some of the most common ones are the following:i.i.d. 2 i.i.d. Normal. Ai N(0, ) for all i = 1,…,n. i.i.d. The error terms A1, . . . , An are i.i.d. Spherical errors (Gauss-Markov assumptions). The error terms A1, . . . , An might have dierent distributions, but their variances and covariances are such thatE(A2i|Xi) = 2 > 0 for all i = 1,,n E(AiAj|X) = 0, i = j.
Thus, here one assumes that, for a given realization of X, the error process is uncorrelated (i.e. Cov(Ai, Aj |X) = E(Ai Aj |X) = 0 for all i = j) and homoscedastic (i.e. Var(Ai |X) = 2 for all i).
Remember. The above i.i.d. assumptions are not as restrictive as they may seem on first sight. They allow that the conditional variances are heteroscedastic, i.e. Var(Ai |Xi) = i2 with dierent variance terms
69

i2 across i = 1,,n, where i may be a function of Xi and/or deterministic quantities.
Technical Note. When we write that E(A2i |X) = 2, we implicitly assume that the second moment of Ai exists and that it is finite.
3.1.1 Some Implications of the Exogeneity Assumption
(a) If E(Ai |Xi) = 0 for all i = 1,,n, then the unconditional mean of the error term is zero, i.e.
E(Ai) = 0, i = 1,,n
Proof: Using the Law of Total Expectations (i.e., E[E(Z|X)] = E(Z)) we can rewrite E(Ai) as
E(Ai) = E[E(Ai |Xi)], for all i = 1,,n. But the exogeneity assumption yields
E[E(Ai |Xi)] = E[0] = 0 for all i = 1,,n, which completes the proof.
(b) Exogeneity is sometimes also called orthogonalitythe reason is the following. Generally, two random variables X and Y are said to be orthogonal if their cross moment is zero, i.e. if E(XY ) = 0.
Under exogeneity, i.e. if E(Ai |Xi) = 0, the regressors and the error 70

term are orthogonal to each other, i.e,
E(Xik Ai) = 0 for all i = 1,,n and k = 1,,K.
Proof:
E(Xik Ai) = E(E(Xik Ai |Xik)) (By the Law of Total Expectations)
= E(XikE(Ai |Xik)) (By the linearity of cond. expectations)
Now, to show that E(Xik Ai) = 0, we need to show that E(Ai |Xik) = 0, which is done in the following:
Since Xik is an element of Xi, a slightly more sophisticated use of the Law of Total Expectations (i.e., E(Y |X) = E(E(Y |X, Z)|X)) implies that
E(Ai |Xik) = E(E(Ai |Xi)|Xik). So, the exogeneity assumption, E(Ai |Xi) = 0 yields
E(Ai |Xik) = E(E(Ai |Xi) |Xik) = E(0|Xik) = 0.
=0
I.e., we have that E(Ai |Xik) = 0 which allows us to conclude that E(Xik Ai) = E(XikE(Ai |Xik)) = E(Xik0) = 0.
(c) Because the mean of the error term is zero (E(Ai) = 0 for all i; see point (a)), it follows that the orthogonality property (E(Xik Ai) = 0) is equivalent to a zero-correlation property. I.e., if E(Ai |Xi) = 0, then
Cov(Ai,Xik) = 0 for all i = 1,,n and k = 1,,K. 71

Proof:
Cov(Ai, Xik) = E(Xik Ai) = E(Xik) E(Ai) (Def. of Cov) = E(Xik Ai) (By point (a): E(Ai) = 0)
= 0 (By orthogonality result in point (b))
3.2 Deriving the Expression of the OLS Estimator
Similar to Section 2.2.2, we can derive the expression for the OLS esti- mator = ( , . . . , )O RK as the vector-valued minimizing argument
1K
of the sum of squared residuals, Sn(b) with b RK, for a given sample
((Y1, X1), . . . , (Yn, Xn)). In matrix terms this is
Sn(b)=(Y =Xb)O(Y =Xb)=YOY =2YOXb+bOXOXb.
To find the minimizing argument
= arg min Sn(b)
bRK we compute all partial derivatives
S(b) = =2 (XOY = XOXb) . b
(K 1)
and set them equal to zero which leads to K linear equations (the normal equations) in K unknowns. This system of equations defines the OLS estimates, , for a given data-set:
= 2 3 X O Y = X O X 4 = 0 .
(K 1) 72

From our rank assumption (Assumption 3) it follows that XOX is an invertible matrix which allows us to solve the equation system by
= (XOX)=1 XOY (K 1)
The following codes computes the estimate for a given realization (Y, X) of the random sample (Y, X).
# Some given data
X_1 <- c(1.9,0.8,1.1,0.1,-0.1,4.4,4.6,1.6,5.5,3.4) X_2 <- c(66, 62, 64, 61, 63, 70, 68, 62, 68, 66)Y <- c(0.7,-1.0,-0.2,-1.2,-0.1,3.4,0.0,0.8,3.7,2.0) dataset <- cbind.data.frame(X_1,X_2,Y)## Compute the OLS estimationmy.lm <- lm(Y ~ X_1 + X_2, data = dataset)## Plot sample regression surface library(“scatterplot3d”) # library for 3d plots plot3d <- scatterplot3d(x = X_1, y = X_2, z = Y,angle=33, scale.y=0.8, pch=16,color =”red”, main =”OLS Regression Surface”) plot3d$plane3d(my.lm, lty.box = “solid”, col=gray(.5),draw_polygon=TRUE) 73OLS Regression Surface60 62 64 66 68 70 1 0 1 2 3 4 5 6X_13.3 Some Quantities of Interest Predicted values and residuals. The (OLS) predicted values: Y = XO O i=1 Oi Inmatrixnotation: Y =X(XX) XY =PXY iii O=1O The (OLS) residual: A = Y = Y Inmatrixnotation:A=Y=Y=(In=X(XX) X)Y=MXYProjection matrices. The matrix PX = X(XOX)=1XO is the (n n) projection matrix that projects any vector (from Rn) into the column space spanned by the column vectors of X and MX = In =X(XOX)=1XO = In =PX is the associated (n n) orthogonal projection matrix that projects any vector (from Rn) into the vector space that is orthogonal to that spanned byX. The projection matrices PX and MX have some nice properties: 74Y2 1 0 1 2 3 4X_2(a) PX and MX are symmetric, i.e. PX = PXO and MX = MXO .(b) PX and MX are idempotent, i.e. PXPX = PX and MXMX = MX.(c) MoreoverXOPX =XO,PXX =X,XOMX =0,MXX =0,andPXMX = 0.All of these properties follow directly from the definitions of PX and MX (check it out). Using these properties one can show that the residual vectorA = (A1, . . . , An)O is orthogonal to each of the column vectors in X, i.e XOA = XOMXY (By Def. of MX)…XOA= 0 Y (sinceXOMX=0) (K n)(n1)…XOA= 0 (K 1)(3.3)Note that, in the case with intercept (3.3) implies that qni=1 Ai = 0. Moreover, equation (3.3) implies also that the residual vector A is orthogonal to the predicted values vector, sinceX O A = 0 OXOA = O0… Y O A = 0 .Another insight from equation (3.3) is that the vector A has to satisfy K linear restrictions (XOA = 0) which means it looses K degrees of freedom. Consequently, the vector of residuals A has only n = K so-called degrees of freedom. This loss of K degrees of freedom also appears in the definition of the unbiased variance estimators2UB= 1 ynA2i. (3.4) n = K i=175 Variance decomposition. A further useful result that can be shown using the properties of PX and MX is that Y OY = Y OY + AOA, i.e.Y O Y = ( Y + A ) O ( Y + A )= (PXY+MXY)O(PXY+MXY) = ( Y O P XO + Y O M XO ) ( P X Y + M X Y ) = Y O P XO P X Y + Y O M XO M X Y + 0 = YOY+AOA(3.5)Decomposition (3.5) is the basis for the well-known variance decomposition result for OLS regressions. For the OLS regression of the linear model (3.1) with intercept, the total sample variance of the dependent variable Y1, . . . , Yn can be decomposed as following:1yn1 221yn3 42 1yn2 n i=1 Yi = Y = n i=1 Yi = Y + n i=1 Ai ,(3.6) total sample variance explained sample variance unexplained sample variance 1qnwhere Y = n i=1 Yi and likewise for Y.Proof of (3.6): As a consequence of (3.3) we have for regressions with interceptthat qni=1 A = 0. Hence, from Y = Y + A it follows that i iii1 yn 1 yn 1 yn ni=1Yi = ni=1Yi+ni=1Ai Y = Y+0(3.7) 76 From (3.5) we know that Y OY = Y OY + AOA, i.e. Y O Y YOY=nY 2O2 YY=nY= = ==Y O Y + A O AY O Y = n Y 2 + A O Ayn 2 2yn 2 2 yn 2 Yi =nY + AiO2 O YY=nY +AA (by(3.7)Y=Y)Yi =nYyn ( Y = Y ) 2 = yn ( Y = Y ) 2 + yn A 2 i=1i=1 i=1 i=1i=1 i=1 iiiCoecient of determination R2. The larger the proportion of the ex- plained variance, the better is the fit of the model. This motivates the definition of the so-called R2 coecient of determination:q ni = 1 3 4 2 q2 Y i = Y ni = 1 iA 2 R=qni=11Yi=Y 22 =1=qni=11Yi=Y 22Obviously, we have that 0 R2 1. The closer R2 lies to 1, the better is the fit of the model to the observed data. However, a high/low R2 does not mean a validation/falsification of the estimated model. Any relation (i.e., model assumption) needs a plausible explanation from2relevant economic theory. The most often criticized disadvantage of the2R is that additional regressors (relevant or not) will always increase the R . Here is an example of the problem.77set.seed(123)n <- 100X <- runif(n, 0, 10) X_ir <- runif(n, 5, 20)# Sample size# Relevant X variable# Irrelevant X variable error <- rt(n, df = 10)*10 # True errorY <-1+5*X+error #Yvariablelm1 <- summary(lm(Y~X)) # Correct OLS regression lm2 <- summary(lm(Y~X+X_ir))# OLS regression with X_ir lm1$r.squared < lm2$r.squared#> [1] TRUE
So, R2 increases here even though X_ir is a completely irrelevant explana- tory variable. Because of this, the R2 cannot be used as a criterion for model selection. Possible solutions are given by penalized criterions such as the so-called adjusted R2 defined as
1 q ni = 1 A 2i R=1= n=K1
2
2R2 1 q ni = 1 Y = Y 2
n=1 i
The adjustment is in terms of the degrees of freedom n = K.
3.4 Method of Moments Estimator
The methods of moments estimator exploits the exogeneity assumption that E(Ai |Xi) = 0 for all i = 1,,n (Assumption 2). Remember that E(Ai |Xi) = 0 implies that E(Xik Ai) = 0 for all i = 1,,n and all k = 1,,K. The fundamental idea behind method of moments estimation is to use the sample analogues of these population moment restrictions E(Xik Ai) = 0,
78

k = 1, . . . , K, for deriving the estimator:
K population moment restrictions K sample moment restrictions
1ynAi=0 Z_ Z_ n _ _ i_ 1yn_
E(A)=0 _ i=1 E(Xi2Ai)=0 _^ Xi2Ai =0 _ _^
1 yn n i=1
Under our set of assumptions (Ass 1-4), the sample means n=1 qni=1 XiAi
. . _ E ( X i A i ) = 0 n i = 1 _
X i A i =
0
_ (K1) . _
_ . _ n. _
(K 1)
E(XA)=0_ iK i
_ 1y_
are consistent estimators of the population means E(Xi Ai). qThe idea is now
to find ,, values which lead to residuals A = Y = K X that 0 K i i k=1kik
fulfill the above sample moment restrictions. This shoquld in principle be possible since we have a linear system of K equations 1 ni=1 XiA = 0 and K
n i=1
iKi
_ X A = 0 _
n unknowns = (0, . . . , K )O. Solving the equation system yields,
1 yn
n i=1 (K1)
XiAi= 0
1yn3 4
Xi Yi=XiO = 0
n i=1 (K1)
1yn 1yn XiYi= XiXiO= 0
n i=1 n i=1 (K1) 1yn 1yn
n i=1 XiXiO = n i=1 XiYi Q 1 yn
R = 1 1 yn
= an i=1 XiXiOb n i=1 XiYi
= (XOX)=1 XOY 79

which equals the OLS estimator of ; although, we used now a dierent approach to derive the estimator.
Once again we see the importance of the exogeneity assumption E(Ai |Xi) which we used here as the starting point for the derivation of the methods of moments estimator. However, unlike with deriving the OLS estimator as the estimator that minimizes the sum of squared residuals, here we derived the estimator from the exogeineity assumptions. The method of moments is a very general method, which usually has good properties. We will return to the method of moments several times throughout the semester.
3.5 Unbiasednessof|X
Once again, but now using matrix algebra, we can show that the OLS (or
likewise the Methods-of-Moments estimator) = (XOX)=1 XOY is unbiased:
E[|X] = E 5(XOX)=1 XOY |X6
= E 5(XOX)=1 XO(X + A)|X6
= E 5(XOX)=1 XOX + (XOX)=1 XO A |X6
= + E 5(XOX)=1 XO A |X6
= + (XOX)=1 XO E[A |X] =

Note. This result only requires the strict exogeneity assumption E(A |X) = 0 which follows from our Assumption 2 (i.e. E(Ai |Xi) = 0 for all i) together with Assumption 1 (i.e. (Yi, Xi) is i.i.d. across i = 1, . . . , n). In particular, we did not need to assume homoscedasticity (it also holds for auto-correlated error terms).
80
Bias[|X] = 0
=0

3.6 Variance of |X
The conditional variance of given X is given by
Var(|X) =E 5( = )( = )O|X6
=E C3(XOX)=1 XO A4 3(XOX)=1 XO A4O |XD
=E 5(XOX)=1 XO A AO X (XOX)=1 |X6 = (XOX)=1 XOE [A AO |X] X (XOX)=1
(nn)
=1 =1 = (XOX) XO Var(A |X) X (XOX) ,

(K K )-dimesnional In the above derivations we used, first, that
= = ( X O X ) = 1 X O Y =
= (XOX)=1 XO(X + A) = = + (XOX)=1 XO A) =
= (XOX)=1 XO A
and, second, that E[AAO|X] = Var(A|X) since the unconditional mean E(A) = 0 under our assumptions. The above expression is the general version of Var(|X) which can be further simplified using specific assumptions on the distribution of the error term A:
In case of spherical errors (Gauss-Markov assumptions), i.e. no heteroscedasticity and non auto-correlations, we have that Var(A|X) =
2In such that
Var(|X) = 2 (XOX)=1
(KK) 81

In case of conditional heteroscedasticitiy, we have that Qc12 0 0Rd
Var(A|X) = c 0 2
c. . .d 1 n
0 d = diag(2,,2), a 0 0 . . . n2 b
where the variances i2 may be functions of Xi, i.e. 2 i2(Xi). So, under conditional heteroscedasticitiy we have the following sandwich form
Var(|X) = (XOX)=1 XO diag(12,,n2)X (XOX)=1 .
Practical estimation of the standard errors. The diagonal elements
in the (K K) matrix Var(|X) are the variance expressions for the single
estimators with k = 1,,K, k
Var( |X) = 5(XOX)=1 XO Var(A |X)X (XOX)=16 (generally) k kk
=2 5(XOX)=16kk , (spherical errors)
where [A]kl denotes the klth element of A that is in the kth row and lth
column of A. Taking square roots leads to the standard errors of the
estimators |X k
SE( |X) = 35(XOX)=1 XO Var(A |X)X (XOX)=16 41/2 (generally) k kk
= 32 5(XOX)=16kk41/2 ,
Of course, the above expressions for Var( |X) and SE( |X) are generally
useless in practice since we typically do not know the (n n) variance matrix 82
kk
(spherical errors)

Var(A|X) or 2, but need to estimate them from the data. So, we typically need to work with
k O O O kk
Var( |X) = 5(X X)=1 X Var(A |X)X (X X)=16 (generally)
=2 5(XOX)=16kk , (spherical errors) k OO Okk
and
SE( |X) = 35(X X)=1 X Var(A |X)X (X X)=16 41/2 (generally)
= 32 5(XOX)=16kk41/2 , (spherical errors) For the case of spherical errors, qwe already know a possible estimator,
2 2 =1 n 2
namely, = sUB = (n = K) i=1 Ai . However, finding a reasonable
estimator Var(A |X) = diag(v1, . . . , vn) for the general heteroscedastic case is a little more tricky. The econometric literature knows the following heteroscedasticity-consistent (HC) robust approaches:
HC0: vi=A2i
HC1: vi= n A2i
HC2: v =
i 1=hi
HC3: vi = A2i 2 (1=hi)
n=K A2i
( Most often used)
where hi = Xi(X X) Xi = [PX]ii is the ith diagonal element of the projec-
HC4:vi= A2i
(1 = hi)i O O =1 q
tion matrix PX , h = n=1 ni=1 hi, and i = min{4, hi/h }.
Side Note. The statistic 1/n hi 1 isqcalled the leverage of Xi, where
(by construction) average leverage is n=1 ni=1 hi = K/n. Observations Xi 83

with leverage statistics hi that greatly exceed the average leverage value K/n are referred to as high-leverage observations. High-leverage observations have potentially a large influence on the estimation result. Typically, high- leverage observations Xi distort the estimation results if the corresponding value of the error term Ai is unusually large (outlier).
The estimator HC0 was suggested in the econometrics literature by White (1980) and is justified by asymptotic arguments. The estimators HC1, HC2 and HC3 were suggested by MacKinnon and White (1985) to improve the performance in small samples. A more extensive study of small sample behavior was carried out by Long and Ervin (2000) which arrive at the conclusion that HC3 provides the best performance in small samples as it inflates the Ai values which is thought to adjust for the over-influence of observations with large leverage values hi. Cribari-Neto (2004) suggested the estimator HC4 to further improve small sample performance, especially in
the presence of influential observations (large hi values).
The following R code shows how to compute HC robust variance/standard
error estimators:
set.seed(2)
n <- 100K <-3X <- matrix(runif(n*(K-1), 2, 10), n, K-1)X <- cbind(1,X)beta <- c(1,5,5)# heteroscedastic errors:sigma <- abs(X[,2] + X[,3])^1.5error <- rnorm(n, mean = 0, sd=sigma)Y <- beta[1]*X[,1] + beta[2]*X[,2] + beta[3]*X[,3] + error ##lm_fit <- lm(Y~X -1 ) 84## Caution! By default R computes the standard errors ## assuming homoscedastic errors. This can lead to ## false inferences under heteroscedastic errors. summary(lm_fit)$coefficients#> Estimate Std. Errort valuePr(>|t|)
#> X1 -3.81839217.797787 -0.2145431 0.830573969
#> X25.474779 2.0849152.6259006 0.010043024
#> X35.566453 2.0118482.7668360 0.006778811
library(sandwich) # HC robust variance estimation library(lmtest)
## Robust estimation of the variance of hat{beta}: Var_beta_hat_robust <- sandwich::vcovHC(lm_fit, type=”HC3″) Var_beta_hat_robust#> X1 X2 X3
#> X1 389.83293 -39.198061 -38.026861
#> X2 -39.19806 5.763624 2.260672
#> X3 -38.02686 2.260672 5.437287
## Corresponding regression-output: lmtest::coeftest(lm_fit, vcov = Var_beta_hat_robust) #>
#> t test of coefficients:
#>
#>Estimate Std. Error t value Pr(>|t|)
#> X1-3.8184
#> X2 5.4748
#> X3 5.5665
#>
19.7442 -0.19340.84706
2.40082.28040.02477 *
2.33182.38720.01892 *
#> Signif. codes:0 *** 0.001 ** 0.01 * 0.05 . 0.1
85
1

Observe that the HC robust variance estimation leads to larger variances than the classic variance estimation for homoscedastic errors. This is typically, but not always, the case.
3.7 The Gauss-Markov Theorem
The Gauss-Markov Theorem. Lets assume Assumptions 1-4 hold with spherical errors, i.e., with E(A AO |X) = 2In. Then the Gauss-Markov theorem states that of all linear and unbiased estimators, the OLS (or Methods of Moments) estimator = (XOX)=1XOY will have the smallest variance, in a matrix sense. That is, for any alternative linear and unbiased estimator we have that
Var( |X) Var(|X) (in the matrix sense) Var( |X)=Var(|X)= D ,
where D is a positive semidefinite (K K) matrix, i.e., aODa 0 for any
K-dimensional vector a RK. Observe that this implies that Var( |X)
Var( |X) for any k = 1,,K. k k
Proof of the Gauss-Markov Theorem. Since is assumed to be linear
in Y , we can write
where C is some (K n) matrix, which is a function of X and/or nonrandom
= C Y , components. Adding a (K n) zero matrix 0 yields
=0
3 =1 =14
(KK)
= C=(XOX) XO+(XOX) XO Y. 86

Let now D = C = (XOX)=1 XO, then
= 3 D + ( X O X ) = 1 X O 4 Y
= DY + (XOX)=1 XOY
= D ( X + A ) + ( X O X ) = 1 X O Y = DX + DA +
E( |X) = E(DX|X) + E(DA|X) + E(|X) = DX + 0 +
(3.8) (3.9)
Since is (by assumption) unbiased, we have that E( |X) = . Therefore, (3.9) implies that DX = 0(KK) since we must have that E( |X) = DX +
0 + = . Plugging DX = 0 into (3.8) yields,
where we used that
= D A +
= = D A + ( = )
= = D A + ( X O X ) = 1 X O A
= = 3D + (XOX)=1 XO4 A,
= = ( X O X ) = 1 X O Y =
= (XOX)=1XO(X + A) =
= (XOX)=1XO A. 87
(3.10)

So,
Var( |X) = Var( = |X) (since is not random)
= Var((D + (XOX)=1XO)A|X) (from equation (3.10)) =(D+(XOX)=1XO)Var(A|X)(DO +X(XOX)=1)
= 2(D + (XOX)=1XO)In(DO + X(XOX)=1)
= 2 1DDO + (XOX)=12 (using that DX = 0)
2(XOX)=1 (Since DDO is pos. semidef.)
= Var(|X).
Showing that DDO is really positive semidefinite: aODDOa = (DOa)O(DOa) = a Oa 0,
where a is a K dimensional column-vector.
Note. To reiterate: The unbiasedness of did not depend on any assump- tions about the distribution of A, except that E(A |X) = 0 which follows from our Assumption 2 together with the i.i.d. assumption in Assumption 1. Once we imposed additionally the assumption of spherical errors E(A A |X) = 2In we can show that has the smallest variance of all linear unbiased estimators.
88

Chapter 4
Small Sample Inference
Note on small sample sizes. Sometimes cases < n = 30 are referred to as small sample cases since one often hopes that the central limit theorem is helping out for cases n = 30. However, this is a very dangerous rule of thumb! A suciently large n that allows us to rely on the central limit theorem depends on many dierent distributional aspects. In this chapter, we consider the case where we cannot rely on the central limit theorem.Exact inference. This chapter considers exact inference using the multiple linear regression model. By exact we mean correct distributions for each sample size n. That is, no asymptotic (large n ) arguments will be used.Assumptions. Recall that we, in general, did not impose a complete distributional assumption on A in Assumption 4 the i.i.d. normal case in Assumption 4 was only one possible option. However, to do exact inference, the normality Assumption on the error terms is not a mere option, but a necessity. So for this chapter we assume that Assumptions 1-3 from Chapter 3 hold and that additionally the following assumption holds:89Assumption 4u: Error distribution. For small sample cases, we assume i.i.d.that the error terms are i.i.d. normal conditionally on Xi, i.e., Ai|Xi N (0, 2) for all i = 1, . . . , n with spherical errors conditionally on X. Thatis,where A = (A1,…,An)O.A |X N 10, 2In2 ,Normality of . Under Assumptions 1-4u it can be shown that |X N 3, Var( |X)4 , (4.1) nnwhere Var( |X) = 2(XOX)=1.n O =1 O O =1 OThis result follows from noting that n = (X X) X Y = +(X X) X Aand because (XOX)=1XO A is just a linear combination of the normally dis-tributed error terms A which, therefore, is again normally distributed, condi-tionally on X. Note that (4.1) is the exact small sample distribution of |X nand fully relies on the conditional normality assumption Assumption 4u. Note. The subscript n in is here only to emphasize that the distributionn of n depends on n; we will, however, often simply write .4.1 Hypothesis Tests about Multiple ParametersLet us consider the following system of q-many null hypotheses: H0 : R = r = 0 ,(qK)(K1) (q1) (q1) 90wherethe(qK)matrixRandtheq-vectorr=(r1,…,rq)O arechosenby the statistician to specify her/his null hypothesis about the unknown true parameter vector . To make sure that there are no redundant equations, it is required that rank(R) = q.We must also specify the alternative against which we are testing the nullhypothesis, for instanceHA :R=r=0The above multiple parameter hypotheses cover also the special case of single parameter hypothesis; for instance, by setting R = (0, 1, 0 . . . , 0) andr = 0 one getsUnder our assumptions (Assumptions 1 to 4u), we have that(R =r)|XN3R=r,RVar(|X)RO)4. nnaround the q-vector 0. If, however, the alternative hypothesis is correct (i.e., (R = r) = a = 0), the realizations of R = r|X scatter around the q-vectora = 0. So, under the alternative hypothesis, there will be a systematic location-shift of the q-dimensional random variable R |X away from rwhich we try to detect using statistical hypothesis testing.4.1.1 The Test Statistic and its Null DistributionH0: k=0 HA : k = 0So, the realizations of (R = r)|X will scatter around the unknown (R = r) nin a non-systematical, Gaussian fashion. Therefore, if the null hypothesis is correct (i.e., (R = r) = 0), the realizations of (R = r)|X scatterThe fact that (R = r) Rq is a q-dimensional random variable makes it a nlittle bothersome to use as a test-statistic. Fortunately, we can turn (Rn = r) 91nnninto a scalar-valued test statistic using the following quadratic form: W=(R =r)O[RVar(|X)RO]=1(R =r) nnn(1q) (qq) (q1)Note that the test statistic W is simply measuring the distance (its a weightedL2-distance) between the two q-vectors R and r. Moreover, under the null nhypothesis (i.e., if the null hypothesis is true), W is just a sum of q-many independent squared standard normal random variables. Therefore, under the null hypothesis, W is 2(q) distributed with q degrees of freedom (see Section 1.2.10.3),W=(R =r)O[RVar(|X)RO]=1(R =r)0 2(q). nnnUsually, we do not know Var( |X) and, therefore, have to estimate nthis quantity. Unfortunately, the general heteroscedasticity consistent ro- bust estimators VarHC0(n|X), …, VarHC4(n|X) from Chapter 3 are only asymptotically justified and, therefore, inappropriate for exact small sample inference.For truly exact finite sample inference, we need a variance estimatorfor which we can derive the exact small sample distribution. Therefore, weassume in Assumption 4u spherical errors (i.e., Var(A|X) = In2) which 2O=1 2yield that Var( q|X) = (X X) , and where can be estimated by ns2UB = (n = K)=1 ni=1 A2i . From the normality assumption in Assumption 4u, it follows then that =2(n = K)s2UB 2(n = K). This then leads to the following exact null distribution ofF =(R =r)O[R(s2 (XOX)=1)RO]=1(R =r)/q 0 F(q,n=K), (4.2) n UB nwhere F(q,n=K) denotes the F-distribution with q numerator and n=K denominator degrees of freedom (see, for instance, Hayashi, 2000, Ch. 1). By contrast to W, F is a practically useful test-statistic, and we can use92HHobserved values F to measure the distance of our estimate R from value obs nr. Observed values, Fobs, that are unusually large under the null hypothesis, lead to a rejection of the null hypothesis. The null distribution F(q,n=K) of F is used to judge whats unusually large under the null hypothesis.The F distribution. The F distribution is a ratio of two 2 distributions. It has two parameters: the numerator degrees of freedom, and the denomina- tor degrees of freedom. Each combination of the parameters yields a dierent F distribution. See Section 1.2.10.6 for more information on the F statistic. F(3,5) F(3,15) F(3,500)F(1,30) F(3,30) F(15,30)01234 012344.2 Tests about One ParameterFor testing a hypothesis about only one parameter k, with k = 1, . . . , K H0: k=rHA : k = r 930.0 0.2 0.4 0.6 0.8 1.00.0 0.2 0.4 0.6 0.8 1.0the (q K=1 K)-matrix R equals a row-vector of zeros but with a one as the kth element (e.g., for k = 2, R = (0,1,0,…,0)) such that F in (4.2) simplifies to3 = r 4 2 H0k F(1,n=K),Var( |X) k 2O=1Owhere Var(k|X) = R(s (X X) )R . Taking square roots yieldst==rt . k H0 (n=K) SE(k |X ) Thus the t-distribution with n = K degrees of freedom is the appropriate reference metric for judging how far away our estimates are from the hypothetical parameter value r under the null hypothesis.Note. All commonly used statistical software packages report t-tests testing the null hypothesis H0 : k = 0, i.e., with r = 0. This means to test the null hypothesis that Xk has no (linear) eect on Y .The following plot illustrates that as the degrees of freedom increase, the shape of the t distribution comes closer to that of a standard normal bell curve. Already for 25 degrees of freedom we find little dierence to the standard normal density. In case of small degrees of freedom values, we find the distribution to have heavier tails than a standard normal.94 N(0, 1) t2t2t25 4 2 0 2 44.3 Testtheory4.3.1 Significance LevelTo actually test the null hypothesis (e.g., H0: R = r = 0 or H0 : k = 0), we need to have a decision rule on when we will reject and not reject the null hypothesis. This amounts to deciding on a probability with which we are comfortable rejecting the null hypothesis when it is in fact true (Type I error or error). The probability of such a Type I error shall be bounded from above by a (small) significance level , that isP (reject H0|H0 is true) = P (Type I Error) = For a given significance level (e.g., = 0.05) and a given alternative hypoth- esis, we can divide the range of all possible values of the test statistic (i.e., R since both t R and F R) into a rejection region and a non-rejection region by using certain quantiles called critical values of the test statistic950.0 0.1 0.2 0.3 0.4distribution under the null. We can do this because the test statistics tH0and F have known distributions under the null hypothesis (t t andn=KF F (q, n = K)); indeed, under Assumption 4 , we know the exact null distributions for every sample size n. Having decided on the rejection andH0 unon-rejection regions, it is a simple matter of seeing where the observed (obs) sample values tobs or Fobs of the statistics t or F areeither in the rejection or in the non-rejection region.Non-conservative versus conservative tests. Since the test statistics F and t are continuous random variables of which we know the exact distri- butions (under Assumptions 1-4u), we can find critical values such thatP(Type I Error) = We call such tests non-conservative since the probability of a type I errorequals the significance level . Test statistics with P(Type I Error) < are called conservative test statistics; they lead to valid inferences, but will detect a violation of the null hypothesis less often than a non-conservative test. A test statistic with P (Type I Error) > leads to invalid inferences!
4.3.2 Critical Value for the F-Test
The critical value c > 0 defines the rejection region, ]c,[, and non- rejection region, ]0, c] which divide the test-statistic space (here R+ since F R+) for a given significance level , such that
P(Type I Error) = PH03F ]c,[4 = ,
where c is here the (1 = /2) quantile of the F -distribution with (q, n = K) degrees of freedom, and where PH0 means that we compute the probability under the assumption that H0 is true.
96

1 = 95% of F9,120 NonRejection Region = 5% of F9,120 Rejection Region
c = 1.9588
01234
The rejection region. The rejection region describes a range of values of the test statistic F which we rarely see if the null hypothesis is true (only in at most 100% cases). If the observed value of the test statistic, Fobs, falls in this region, we will reject the null hypothesisand hereby, accept Type I errors in at most 100% of cases.
The non-rejection region. The non-rejection region describes a range of values of the test statistic F which we expect to see (in (1 = ) 100% cases) if the null hypothesis is true. If the observed value of the test statistic, Fobs falls in this region, we will not reject the null hypothesis.
Caution: Not rejecting the null hypothesis does not mean that we can conclude that the null hypothesis is true. We only had no suciently strong evidence against the null hypothesis. A violation of the null hypothesis, for
97
0.0 0.2 0.4 0.6 0.8 1.0

instance R = r = a = 0, may simply be too small (too small a value) to stand out from the estimation errors (measured by the standard error) in .
Reading the F-Table. Fortunately, you do not need to read old-school distribution tables to find the critical value c, but can simply use R
Changing the significance level from = 0.05 to = 0.01 makes the critical value c larger and, therefore, the rejection region smaller (fewer Type I errors)
k
df1 <- 9 # numerator dfdf2 <- 120 # denominator dfalpha <- 0.05 # significance level## Critical value c_alpha (= (1-alpha) quantile): c_alpha <- qf(p = 1-alpha, df1 = df1, df2 = df2) c_alpha#> [1] 1.958763
alpha <- 0.01## Critical value c_alpha = 1-alpha-quantile: c_alpha <- qf(p = 1-alpha, df1 = df1, df2 = df2) c_alpha#> [1] 2.558574
4.3.3 Critical Value(s) for the t-Test
In case of the t-test, we need to dierentiate between two-sided and one-sided testing.
98

Two-Sided t-Test Two-sided hypothesis:
H0: k=r HA : k = r
In case of a two-sided t-test, we reject the null hypothesis if the observed realization of the t-test, tobs, is far away from zero either by being suciently smaller or greater than r. The corresponding two-sided critical values are denoted by =c/2 > 0 and c/2 > 0, where c/2 > 0 is the (1 = /2) quantile of the t-distribution with (n = K) degrees of freedom. These critical values defines the following rejection and the non-rejection regions
rejectionregion: ]=,=c/2[ fi ]c/2,[ non-rejection region: [=c/2, c/2].
For this rejection region it holds true that P(TypeIError)=PH03t]=,=c/2[ fi ]c/2,[4 =.
99

95% of t12 5% of t12
c 2=2.18 c 2=2.18
4 2 0 2 4
One-Sided t-Test One-sided hypothesis:
H0: k=r
HA : k > r (or HA : k < r)In case of a one-sided t-test, we will reject the null if tobs is suciently far away from zero in the relevant direction of HA. The corresponding critical valueiseither=c (HA :k r),wherec isthe(1=) quantile of the t-distribution with (n = K) degrees of freedom. The critical value c defines the following rejection and the non-rejection regions:
ForHA :k =0 versus HA :k <0:rejection region: ] = , =c[non-rejection region: [=c, [ 1000.0 0.1 0.2 0.3 0.4 0.5 0.6such thatForHA :k =0 versus HA :k >0:
such that
P (Type I Error) = PH0 3t ] = , =c[4 = . rejection region: ]c, [
non-rejection region: ] = , c] P(Type I Error) = PH03t ]c,[4 = .
95% of t12 5% of t12
c = 1.78
95% of t12 5% of t12
c = 1.78
4 2 0 2 4 4 2 0 2 4
Reading the t-Table. Fortunately, you do not need to read old-school distribution tables to find the critical values c/2 or c, but you can simply use R
101
0.0 0.1 0.2 0.3 0.4 0.5 0.6
0.0 0.1 0.2 0.3 0.4 0.5 0.6

df <- 16 # degrees of freedomalpha <- 0.05 # significance level## One-sided critical value (= (1-alpha) quantile): c_oneSided <- qt(p = 1-alpha, df = df)c_oneSided#> [1] 1.745884
## Two-sided critical value (= (1-alpha/2) quantile): c_twoSided <- qt(p = 1-alpha/2, df = df)## lower critical value-c_twoSided#> [1] -2.119905
## upper critical value
c_twoSided
#> [1] 2.119905
4.4 Type II Error and Power
A Type II error is the mistake of not rejecting the null hypothesis when in fact it should have been rejected. The probability of making a Type II error equals one minus the probability of correctly rejecting the null hypothesis (Power). For instance, in the case of using the t-test to test the null hypothesis H0 : k = 0 versus the one-sided alternative hypothesis HA :k >0)wehavethat
P(TypeIIError)=PHA t ]=,c]
3 rejection region 4
3 non-rejection region 4
=1=PHA t ]c,[ ,

Power
102

where PHA means that we compute the probability under the assumption that HA is true.
There is a trade o between the probability of making a Type I error and the probability of making a Type II error: a lower significance level , decreases P(Type I Error), but necessarily increases P(Type II Error) and vice versa. Ideally, we would have some sense of the costs of making each of these errors, and would choose our significance level to minimize these total costs. However, the costs are often dicult to know. Moreover, the probability of making a Type II error is usually impossible to compute, since we usually do not know the true distribution of under the alternative hypothesis. k
For illustration purposes, however, consider the case of a t test for a one-sided hypothesis
H0: k=0 HA : k > 0,
where the true (usually unknown) parameter value is k =O 3 and where the true (usually also unknown) standard error SE( |X) = 2[(XOX)=1] =
k kk 1.5. The advantage here is that we can derive the distribution of the t-test
statistic even under the alternative hypothesis. Note that the distribution
of the t-test statistic bOecomes here a standard normal distribution, since we
assume SE( |X) = 2[(XOX)=1] = 1.5 to be a known (determinsitc) k kk
quantity. (This completely unrealistic assumption is only used for illustrative puposes!)
Distribution under the null hypothesis (i.e., if k = 0 were true): k H0
t=O N(0,1) 2[(XOX)=1]kk
Distribution under the alternative hypothesis for the true parameter value 103

(i.e., for the actual k = 3):
t= =+3=3
OkOk 2[(XOX)=1]kk 2[(XOX)=1]kk
Ok O HA
= = 3 + 3 N ( , 1 )
2[(XOX)=1]kk 2[(XOX)=1]kk
N(0,1) = (mean-shift)
So, the mean-shift depends on the value of O2[(XOX)=1]kk and the dierence between the actual parameter value (k = 3) and the hypothetical parameter under the null-hypothesis (here 0). The following Graphic illus- tOrates the probability of a type II error and the power for the case where
2[(XOX)=1]kk = 1.5 such that = 3/1.5 = 2.
0.0 0.1 0.2 0.3 0.4 0.5 0.6
Null Distribution N(0,1)
P(Type II Error) c = 1.64 Power
N(0,1)
N(2,1)
4 2 0 2 4 6
104

4.5 p-Value
The p-value of a test statistic is the significance level we would obtain if we took the sample value of the observed test statistic, Fobs or tobs, as the border between the rejection and non-rejection regions.
F-test p = PH0(F Fobs)
t-test (HA :k =r) p=2min{PH0(ttobs),PH0(ttobs)} t-test (HA :k >r) p=PH0(ttobs)
t-test (HA :k 1 99.87
#> 2 96.31
#> 3 83.14
round(head(test_data_new, 3), 2) # New Y, conditionally on X #> YX_1X_2 X_3
#> 197.98 1 6.94 18.49
## compare
round(head(test_data,
#> YX_1X_2 X_3
1 6.94 18.49
1 6.55 19.06
1 3.23 16.51
108

#> 2 100.03 1 6.55 19.06
#> 375.38 1 3.23 16.51
4.7.1 Normally Distributed |X
The above data generating process fulfills our regulatory assumptions Assump-
tion 1-4u. So, by theory, the estimators |X should be normal distributed
conditionally on X
|XN(,2[(XOX)=1] ) k k kk
where [(XOX)=1]kk denotes the element in the kth row and kth column of
the matrix (XOX)=1. Lets check the distribution by means of a Monte Carlo
simulation for the case of |X with a small sample size of n = 10. 2
set.seed(123)
k
n
beta_true
sigma
<- 10 # a small sample size<- c(2,3,4) # true data vector<- 3 # true var of the error term## Lets generate a data set from our data generating processmydata <- myDataGenerator(n = n, beta=beta_true) X_cond <- cbind(mydata$X_1, mydata$X_2, mydata$X_3)## True mean and variance of the true normal distribution ## of beta_hat_2|X:# true meanbeta_true_2 <- beta_true[2]# true variancevar_true_beta_2 <- sigma^2 * diag(solve(t(X_cond) %*% X_cond))[2] 109## Lets generate 5000 realizations from beta_hat_2## conditionally on X and check whether their## distribution is close to the true normal distribution rep <- 5000 # MC replicationsbeta_hat_2 <- rep(NA, times=rep)##for(r in 1:rep){MC_data <- myDataGenerator(n = n,beta = beta_true,X = X_cond)lm_obj <- lm(Y ~ X_2 + X_3, data = MC_data)beta_hat_2[r] <- coef(lm_obj)[2] }## Compare## True beta_2 versus average of beta_hat_2 estimates beta_true_2#> [1] 3
round(mean(beta_hat_2), 4)
#> [1] 3.0091
## True variance of beta_hat_2 versus
## empirical variance of beta_hat_2 estimates round(var_true_beta_2, 4)
#> [1] 0.416
round(var(beta_hat_2), 4)
#> [1] 0.4235
## True normal distribution of beta_hat_2 versus ## empirical density of beta_hat_2 estimates library(scales)
110

curve(expr = dnorm(x, mean = beta_true_2, sd=sqrt(var_true_beta_2)),
xlab=,ylab=, col=gray(.2), lwd=3, lty=1,
xlim=range(beta_hat_2), ylim=c(0,1.1)) lines(density(beta_hat_2, bw = bw.SJ(beta_hat_2)),
col=alpha(blue,.5), lwd=3) legend(topleft, lty=c(1,1), lwd=c(3,3),
col=c(gray(.2), alpha(blue,.5)), bty=n, legend= c(expression(
Theoretical Gaussian Density of~hat(beta)[2]~|~X), expression(
Empirical Density Estimation based on MC realizations from~ hat(beta)[2]~|~X)))
Theoretical Gaussian Density of ^ | X 2^
Empirical Density Estimation based on MC realizations from 2 | X
12345
Great! The nonparametric density estimation (estimated via density()) 111
0.0 0.2 0.4 0.6 0.8 1.0

computed from the simulated realizations of |X is indicating that |X is 22
really normally distributed as described by our theoretical result in Equation (4.1).
However, what would happen if we would sample unconditionally on X? How does the distribution of would then look like?
2
set.seed(123)
## Lets generate 5000 realizations from beta_hat_2 ## WITHOUT conditioning on X
beta_hat_2_uncond <- rep(NA, times=rep)##for(r in 1:rep){MC_data <- myDataGenerator(n = n,beta = beta_true)lm_obj <- lm(Y ~ X_2 + X_3, data = MC_data)beta_hat_2_uncond[r] <- coef(lm_obj)[2] }## Compare## True beta_2 versus average of beta_hat_2 estimates beta_true_2#> [1] 3
round(mean(beta_hat_2_uncond), 4)
#> [1] 2.9973
## True variance of beta_hat_2 versus
## empirical variance of beta_hat_2 estimates round(var_true_beta_2, 4)
#> [1] 0.416
round(var(beta_hat_2_uncond), 4)
#> [1] 0.2521
112

## Plot
curve(expr = dnorm(x, mean = beta_true_2, sd=sqrt(var_true_beta_2)),
xlab=,ylab=, col=gray(.2), lwd=3, lty=1,
xlim=range(beta_hat_2_uncond), ylim=c(0,1.1)) lines(density(beta_hat_2_uncond, bw=bw.SJ(beta_hat_2_uncond)),
col=alpha(blue,.5), lwd=3) legend(topleft, lty=c(1,1), lwd=c(3,3),
col=c(gray(.2), alpha(blue,.5)), bty=n, legend= c(expression(
Theoretical Gaussian Density of~hat(beta)[2]~|~X), expression(
Empirical Density Estimation based on MC realizations from~
hat(beta)[2])))
Theoretical Gaussian Density of ^ | X 2^
Empirical Density Estimation based on MC realizations from 2
12345
113
0.0 0.2 0.4 0.6 0.8 1.0

Not good. Since we do not condition on X, the realizations of X aect the distribution of and our theoretical Gaussian distribution result in Equation
(4.1) does not apply anymore.
4.7.2 Testing Multiple Parameters
In the following, we do inference about multiple parameters. We test
Or equivalently
versus
H0:2=3 and 3=4
HA :2 = 3 and/or 3 = 4.
H0 :R = r = 0 HA :R=r=0,
where
The following R code can be used to test this hypothesis:
R=Qa0 1 0Rb and r=Qa3Rb. 001 4
suppressMessages(library(car)) # for linearHyothesis() # ?linearHypothesis
## Estimate the linear regression model parameters
lm_obj <- lm(Y ~ X_2 + X_3, data = mydata)## Option 1:car::linearHypothesis(model = lm_obj,hypothesis.matrix = c(“X_2=3”, “X_3=4”))#> Linear hypothesis test
114

#>
#> Hypothesis:
#> X_2 = 3
#> X_3 = 4
#>
#> Model 1: restricted model
#> Model 2: Y ~ X_2 + X_3
#>
#>
#> 1
#> 2
#>
#> Signif. codes:0 *** 0.001 ** 0.01 * 0.05 . 0.1
## Option 2:
R <- rbind(c(0,1,0), c(0,0,1))car::linearHypothesis(model = lm_obj, hypothesis.matrix = R,rhs = c(3,4)) #> Linear hypothesis test
#>
#> Hypothesis:
#> X_2 = 3
#> X_3 = 4
#>
#> Model 1: restricted model
#> Model 2: Y ~ X_2 + X_3
#>
#> Res.DfRSS Df Sum of SqFPr(>F)
Res.DfRSS Df Sum of Sq
9 87.285
FPr(>F)
7 37.599249.686 4.6252 0.05246 .
115
1

#> 19 87.285
#> 27 37.599249.686 4.6252 0.05246 .
#>
#> Signif. codes:0 *** 0.001 ** 0.01 * 0.05 . 0.1
Not surprisingly, we cannot reject the null hypothesis at a significance level of, for instance, = 0.05 since we actually test the true null hypothesis. However, in repeated samples we should nevertheless observe 100% type I errors (false rejections of H0). Lets check this using the following Monte Carlo simulation:
1
## Lets generate 5000 F-test decisions and check ## whether the empirical rate of type I errors is ## close to the theoretical significance level. rep <- 5000 # MC replications F_test_pvalues <- rep(NA, times=rep)##for(r in 1:rep){## generate new MC_data conditionally on X_condMC_data <- myDataGenerator(n = n,beta = beta_true,X = X_cond)lm_obj <- lm(Y ~ X_2 + X_3, data = MC_data)## save the p-valuep <- linearHypothesis(lm_obj,c(“X_2=3”, “X_3=4”))$Pr(>F)[2]
F_test_pvalues[r] <- p}##signif_level <-0.05 116rejections <- F_test_pvalues[F_test_pvalues < signif_level] round(length(rejections)/rep, 3)#> [1] 0.05
##
signif_level <- 0.01rejections <- F_test_pvalues[F_test_pvalues < signif_level] round(length(rejections)/rep, 3)#> [1] 0.009
Note that this is actually a very strong result. First, it means that we correctly control for the type I error rate since the type I error rate is not larger than the chosen significance level . Second, it means that the test is not conservative (i.e. very ecient) since the type I error rate is approximately equal to the chosen significance level . (In fact, if we would increase the number of Monte Carlo repetitions, the empirical type I error rate would converge to the selected significance level dut to the law of large numbers.)
Next, we check how well the F test detects certain violations of the null hypothesis. We do this by using the same data generating process, but by testing the following incorrect null hypothesis:
H0:2=4 and 3=4 HA :2 =4 and/or 3 =4
set.seed(321)
rep <- 5000 # MC replications F_test_pvalues <- rep(NA, times=rep)##for(r in 1:rep){## generate new MC_data conditionally on X_condMC_data <- myDataGenerator(n = n, 117 beta = beta_true,X = X_cond)lm_obj <- lm(Y ~ X_2 + X_3, data = MC_data)## save p-values of all rep-many testsF_test_pvalues[r] <- linearHypothesis(lm_obj, c(“X_2=4″,”X_3=4”))$Pr(>F)[2]
}
##
signif_level <- 0.05rejections <- F_test_pvalues[F_test_pvalues < signif_level] length(rejections)/rep#> [1] 0.3924
Indeed, we can now reject the (false) null hypothesis in approximately 38% of all resamplings from the true data generating process. Caution: This also means that we are not able to see the violation of the null hypothesis in 100% = 38% = 62% of cases. Therefore, we can never use an insignificant test result (p-value ) as a justification to accept the null hypothesis.
Moreover, note that the F test is not informative about which part of the null hypothesis (2 = 4 and/or 3 = 4) is violated we only get the information that at least one of the multiple parameter hypotheses is violated:
car::linearHypothesis(lm_obj, c(X_2=4, X_3=4)) #> Linear hypothesis test
#>
#> Hypothesis:
#> X_2 = 4
#> X_3 = 4
#>
#> Model 1: restricted model
118

#> Model 2: Y ~ X_2 + X_3
#>
#>
#> 1
#> 2
#>
#> Signif. codes:0 *** 0.001 ** 0.01 * 0.05 . 0.1
Res.Df RSS Df Sum of SqF Pr(>F)
9 141.245
732.5722108.67 11.677 0.005889 **
4.7.3 Dualty of Confidence Intervals and Hypothesis Tests
Confidence intervals can be computed using R as following:
1
signif_level <- 0.05## 95% CI for beta_2confint(lm_obj, parm = “X_2”, level = 1 – signif_level) #> 2.5% 97.5%
#> X_2 1.370315 3.563536
## 95% CI for beta_3
confint(lm_obj, parm = X_3, level = 1 signif_level) #> 2.5% 97.5%
#> X_3 3.195389 4.695134
We can use these two-sided confidence intervals to do hypothesis tests. For instance, when testing the null hypothesis
H0 :3 =4 versus HA :3 = 4
119

we can check whether the confidence interval CI1= contains the hypothetical value 4 or not. In case of 4 CI1=, we cannot reject the null hypothesis. In case of 4 CI1=, we reject the null hypothesis.
If the Assumption 1-4u hold true, then CI1= is an exact confidence interval. That is, under the null hypothesis, it falsely rejects the null hypothesis in only 100% of resamplings. Lets check this in the following R code:
## Lets generate 1000 CIs
set.seed(123)
signif_level <- 0.05rep <- 5000 # MC replications confint_m <- matrix(NA, nrow=2, ncol=rep) ##for(r in 1:rep){## generate new MC_data conditionally on X_condMC_data <- myDataGenerator(n = n,beta = beta_true,X = X_cond)lm_obj <- lm(Y ~ X_2 + X_3, data = MC_data)## save the p-valueCI <- confint(lm_obj, parm=”X_2″, level=1-signif_level)confint_m[,r] <- CI}##inside_CI <- confint_m[1,] <= beta_true_2 & beta_true_2 <= confint_m[2,]## CI-lower, CI-upper, beta_true_2 inside?head(cbind(t(confint_m), inside_CI)) #> inside_CI
120

#> [1,] 0.8555396 3.639738 1
#> [2,] 0.9143542 3.270731 1
#> [3,] 1.9336526 4.984167 1
#> [4,] 1.9985874 3.812695 1
#> [5,] 3.0108642 5.621791 0
#> [6,] 2.0967675 4.716398 1
round(length(inside_CI[inside_CI == FALSE])/rep, 2) #> [1] 0.05
nCIs <- 100 plot(x=0,y=0,type=”n”,xlim=c(0,nCIs),ylim=range(confint_m[,1:nCylab=””, xlab=”Resamplings”, main=”Confidence Intervals”) for(r in 1:nCIs){if(inside_CI[r]==TRUE){lines(x=c(r,r), y=c(confint_m[1,r], confint_m[2,r]),lwd=2, col=gray(.5,.5)) }else{lines(x=c(r,r), y=c(confint_m[1,r], confint_m[2,r]), lwd=2, col=”darkred”)} }axis(4, at=beta_true_2, labels = expression(beta[2])) abline(h=beta_true_2) 121Is]),Confidence Intervals0 20 40 60 80 100 Resamplings12201234562Chapter 5Large Sample Inference5.1 Tools for Asymptotic Statistics 5.1.1 Modes of ConvergenceIn the following we will discuss the four most important convergence concepts for sequences of random variables (z1,z2,…,zn) shortly denoted by {zn}. Non-random scalars (or vectors or matrices) will be denoted by Greek letters such as .Four Important Modes of Convergence1. Convergence in Probability: A sequence of random scalars {zn} con- verges in probability to a constant (non-random) if, for any (arbitrarily small) A > 0,
lim P(|zn =|>A)=0. n
p
123
We write: plimn zn = , or zn = . Convergence in probability of a sequence of random vectors (or matrices) {zn} to a constant vector (or matrix) requires element-wise convergence in probability.

2. Almost Sure Convergence: A sequence of random scalars {zn} con- verges almost surely to a constant (non-random) if
P 3 lim zn = 4 = 1. n
a.s.
We write: zn = . Almost sure convergence of a sequence of random vectors
(or matrices) {zn} to a constant vector (or matrix) requires element-wise almost sure convergence.
Note. Almost sure convergence is (usually) rather hard to derive, since the probability is about an event concerning an infinite sequence. Fortunately, however, there are established strong laws of large numbers that we can use for showing almost sure convergence.
3. Convergence in Mean Square: A sequence of random scalars {zn} converges in mean square (or in quadratic mean) to a constant (non- random) if
lim E 1(zn = )22 = 0 n

E ((zn = )2) is termed the mean squared error: MSE(zn) = E ((zn = )2). Mean square convergence of a sequence of random vectors (or matrices) {zn} to a deterministic vector (or matrix) requires element-wise mean square convergence.
4. Convergence in Distribution: Let Fn be the cumulative distribution function (cdf) of zn and F the cdf of z. A sequence of random scalars {zn} converges in distribution to a random scalar z if for all t such that F (t) is continuous at t,
lim Fn(t) = F (t). n 124
m.s.
Wewrite: zn=. Ifzn isanestimator(e.g.,zn =k,n)theexpression

d
We write: zn = z and call F the asymptotic (or limit) distribution of da
zn. Sometimes you will see statements like zn =N(0,1) or zn N(0,1), d
which should be read as zn =z, where z N(0,1).
Note. A stochastic sequence {zn} can also convergence in distribution to a deterministic scalar . In this case is treated as a degenerated random
variable with cdf
[1 if t
F (t) = Y]0 if t < 4O. Multivariate Convergence in Distribution: Let zn,z RK beK-dimensional random variables, then ddzn =z if and only if Ozn =Ozfor any RK . This statement is known as the Cramer Wold Device. It is needed since element-wise convergence in distribution does generally not imply convergence of the joint distribution of zn to the joint distribution of z; except, if all elements are independent from each other.Relations among Modes of ConvergenceLemma 5.1.1. Relationship among the four modes of convergence:m.s. p(i) zn=zn=.a.s. p(ii) zn = zn =.dp(iii) zn = … zn = . I.e., if the limiting random variable is a constant (i.e., a degenerated random variable), then convergence in distribution is equivalent to convergence in probability.125Proofs can be found, e.g., here: https://www.statlect.com/asymptotic- theory/relations-among-modes-of-convergence5.1.2 Continuous Mapping Theorem (CMT)Lemma 5.1.2. Preservation of convergence for continuous trans- formations (or “continuous mapping theorem (CMT)”): Suppose {zn} is a stochastic sequence of random scalars, vectors, or matrices and that a() is a continuous function that does not depended on n. Thenpp(i) zn=a(zn)=a() a.s. a.s.(ii) zn=a(zn)=a() dd(iii) zn = a(zn) = a()Proof can be found, e.g., in Asymptotic Statistics, van der Vaart (1998), Theorem 2.3. Or here: https://www.statlect.com/asymptotic-theory/conti nuous-mapping-theoremNote. The CMT does not hold for m.s.-convergence except for the case where a(.) is linear.Examples. As a consequence of the CMT (Lemma 5.1.2) we have that the usual arithmetic operations preserve convergence in probability (equivalently for almost sure convergence and convergence in distribution):pppIfxn= and yn= then xn+yn=+ pppIfxn= and yn= then xnyn= 126pppIf xn = and = 0yn = then xn/yn = /, provided that O =1 p =1O pIf XnXn =XOXthen (XnXn) =XOX, provided XOX is a non- 5.1.3 Slutsky Theoremsingular matrix.The following results are concerned with combinations of convergence in probability and convergence in distribution. These are particularly important for the derivation of the asymptotic distribution of estimators.Lemma 5.1.3 (Slutsky Theorem). Let xn and yn denote sequences of ran- dom scalars or vectors and let An denote a sequences of random matrices. Moreover, and A are deterministic limits of appropriate dimensions and x is a random limit of appropriate dimension.dpd(i) Ifxn=x and yn= then xn+yn=x+ dpp(ii)Ifxn=x and yn=0 then xOnyn=0 dpd(iii) If xn =x and An =A then Anxn =Ax, where it is as- sumed that An and xn are conformable (i.e., the matrix- and vector- dimensions fit to each other).Important special case:dpdIf xn =N(0,) and An =A then Anxn =N(0,AAO) Proofs can be found, e.g., in Asymptotic Statistics, van der Vaart, Theorem2.8. Or here: https://www.statlect.com/asymptotic-theory/Slutsky-theorem 127Note. Sometimes, only parts (i) and (ii) of Lemma 5.1.3 are called Slut- skys theorem.5.1.4 Law of Large Numbers (LLN) and Central Limit Theorem (CLT)So far, we discussed the definitions of the four most important convergence modes, their relations among each other, and basic theorems (CMT and Slutsky) about functionals of stochastic sequences. Though, we still lack of tools that allow us to actually show that a stochastic sequence convergences(in some of the discussed modes) to some limit.In the following we consider the stochastic sequences {z } of sample meansz = n=1 qni=1 z , where z , i = 1,…,n, are (scalar, vector, or matrix-valued) niirandom variables. Remember: the sample mean z is an estimator of thedeterministic population mean .Weak LLNs, strong LLqNs, and CLTs tell us conditions under whicharithmetic means z = n=1 ni=1 z converge: nipnnz = (weakLLN)a.s.z = (strongLLN)Ond2 n(z =)=N(0,) (CLT)nn In the following we introduce the most well-known versions of the weak, the strong LLN, and the CLT.Theorem 5.1.4 (Weak LLN (Chebychev)).If limE(z )= and limVar(z )=0 then z =nn nn nProof can be found, for instance, here: https://www.statlect.com/asymptotic-theory/law-of-large-numbers128pTheorem 5.1.5 (Strong LLN (Kolmogorov)).If{z}is an iid sequence and E(z)= then z =a.s. iinProof can be found, e.g., in Linear Statistical Inference and Its Applications, Rao (1973), pp. 112-114.Note. The weak and the strong LLN for random vectors follow from requiring element-by-element convergence.Theorem 5.1.6 (CLT (Lindeberg-Levy)).If {zi} is an iid sequence and E(zi) = and Var(zi) = 2 thenOd2 n(z =)=N(0,) as nProof can be found, e.g., in Asymptotic Statistics, van der Vaart (1998), Theorem 2.17.The Lindeberg-Levy CLT for K-dimensional random vectors follows fromour above discussion on Multivariate Convergence in Distribution. Fromthisweknowthatifz RK andRK,then n nOdOdn(z =)=N(0,) … n(Oz =O)=N(0,O),for any RK.That is, to apply the Lindeberg-Levy CLT (Theorem 5.1.6) to multivari-ate (e.g., K-dimensional) stochastic sequences, we need to check whether the univariate stochastic sequence {Ozi} is i.i.d. with E(Ozi) = O and Var(Ozi) = O for any RK. This is the case if the multivariate (K- dimensional) stochastic sequence {zi} is an i.i.d. sequence with E(zi) = and Var(zi) = .129nnNote. The LLNs and the CLT are stated with respect to sequences of sample means {z }; i.e., the simplest estimators you probably can think of.We will see, however, that this is all we need in order to analyze also more complicated estimators such as the OLS estimator.5.1.5 Estimators as a Sequences of Random VariablesOur concepts above readily apply to general scalar-valued (univariate) or vector-valued (K-dimensional) estimators, say RK, that are computed from i.i.d. random samples. n(Weak) Consistency: We say that an estimator is (weakly) consistentnfor ifAsymptotic Bias: The asymptotic bias of an estimator of some truepn= as nparameter is defined as: n ABias( ) = lim E( ) = n n nIf ABias( ) = 0, then is called an asymptotically unbiased.nAsymptotic Normality: A consistent estimator is asymptotically nor- mal distributed if nOn(n =)=N(0,) as ndn Var(On( = )) = lim Var(On ) = as called the n n n nwhere limasymptotic variance of On( = ). n130OpO n-consistent: Consistent estimators n = are called n-consistent ifOIf additionally the random vector z is normal distributed, then is often called consistent and asymptotically normal. n5.2 Asymptotics under the Classic Regression ModelGiven the above introduced machinery on asymptotic concepts and results, we can now proof that the OLS estimators and s2UB applied to the classic regression model (defined by Assumptions 1-4) are consistent and asymptot- ically normal distributed estimators as n . That is, we can drop the unrealistic normality and spherical errors assumption (Assumption 4u), but still use the usual test statistics (t-test, F-test); as long as the sample size n is large.However, before we can formally state the asymptotic properties, we first need to adjust our data generating process assumption (Assumption 1) such that we can apply Kolmogorovs strong LLN and Lindeberg-Levys CLT. Second, we need to adjust the rank assumption (Assumption 3), such that the full column rank of X is guaranteed for the limiting case as n , too. Assumptions 2 and 4 from Chapter 3 are assumed to hold.Assumption 1u: Data Generating Process for Asymptotics As- sumption 1 applies, but additionally we assume that (Xi,Ai) (or equivalently (Yi, Xi)) is i.i.d. for all i = 1, . . . , n, with existing and finite second moments for Xi and fourth moments for Ai. (Note: The fourth moment of Ai is actuallyonly needed for Theorem 5.2.4; for the rest two moments are sucient.) 131n(n=)=z as nd Note. The above adjustment of Assumption 1 is far less restrictive than assuming that the error-terms are normally distributed (as its necessary for small samples).Assumption 3u: Rank Condition for Asymptotics The (K K)limiting matrix X XO= E(SOiO) = E(X X ) inhas full rank K; i.e., X X pOthis assumption does not assume that SX X X XX X in=1XOX = S pXOX =XOXas ntherefore, is nonsingular and invertible. (Note,= hold, it assumes that the limiting matrix is of full rank K.)Note. The crucial part of Assumption 3u is that the limit matrix has full rank K. The convergence in probability statement (S palready from Assumption 1u. XOX =XOX) followsNote. Assumption 3u implies the existence and finiteness of the first two moments of Xi (even without Assumption 1u).Theorem 5.2.1 (Consistency of S=1 ). Under Assumption 1u and 3u we have that XOXA1XOXB=1 =S=1 p =1n XOX = XOX as nProof is done in the lecture.Theorem 5.2.2 (Consistency of ). Under Assumption 1u, 2 and 3u we have that pn= as nProof is done in the lecture. O Furthermore, we can show that the appropriately scaled (byn) sampling error = of the OLS estimator is asymptotically normal distributed.OO, but if this convergence132Theorem 5.2.3 (Sampling error limiting normality (the classic case)). For simplicity, we consider here the special case of spherical error (Var(A|X) = 2In). Under Assumption 1u, 2 and 3u we then have thatO d 1 2=12 n(n=)=N 0,XOXas nIn principle, we can derive the usual test statistics from the latter result.we need , where the consistency of the former estimator is provided by Prop. 5.2.1 and the consistency ofs2UB is provided by the following result. Theorem 5.2.4 (Consistency of s2UB).ct.com/fundamentals-of-statistics/OLS-estimator-properties5.2.1 The Case of HeteroscedasticityTheorem 5.2.3 can also be stated and proofed for conditionally heteroscedastic error terms. In this case one gets Proof is done in the lecture.Though, as long as we do not know (we usually dont) 2 and X Xto plug-in the (consistent!) estimators S=1 O2O2p2sUB= as nProof is skipped, but a detailed proof can be found here: https://www.statleO d 1 =1 2 O =1 2 n(n =)=N 0,XOXE(Ai XiXi)XOXand s XX UB(5.1) (i.e., the asymptotic variance of n( = )) is where=1 2O =1as n O E(A XX)XXiiiXX n OOusually unknown and needs to be estimated from the data by=1E(A2XXO)S=1p=1 2 O=1 SXOX i i i XOX =XOXE(Ai XiXi)XOX133as n,where E(A2i XiXiO) denotes some consistent estimator of E(A2 XiXiO) such as one of the followingt choices:HC0: E(A2iXiXiO)=1yn A2iXiXiO n i=1 HC1: E(A2i XiXiO) = 1 yn n A2i XiXiO ni=1 n=KHC2:E(A2XXO)=1yn A2i XXO i i i ni=11=hi i iHC3: E(A2iXiXiO)=1yn A2i 2XiXiO (Mostoftenused) ni=1 (1=hi)HC4: E(A2 XiXO) = 1 yn A2i XiXO i i n i=1 (1 = hi)i i(These are the heteroscedasticity-consistent robust estimators from Chapter3.6.Note. In order to show that any of the above versions (HC0-4) of E(A2i XiXiO) is a consistent estimator of E(A2i XiXiO) we actually need to assume that the explanatory variables in X have finite fourth moments (see Hayashi, 2000, Chapter 2.5). So, for this, we would need to make our Assumption 1u more restrictive (only two moments are assumed for X).5.2.2 Hypothesis Testing and Confidence IntervalsFrom our asymptotic results under the classic regression model (Assumptions 1u, 2, 3u, and 4) we get the following results important for testing statistical hypothesis.1345.2.2.1 Robust Hypothesis Testing: Multiple ParametersLet us reconsider the following system of q-many null hypotheses: H0 : R = r = 0 ,(qK)(K1) (q1) (q1)wherethe(qK)matrixRandtheq-vectorr=(r1,…,rq)O arechosenby the statistician to specify her/his null hypothesis about the unknown true parameter vector . To make sure that there are no redundant equations, it is required that rank(R) = q.By contrast to the multiple parameter tests for small samples (see Chapter4.1), we can work here with a heterosedasticity robust test statistic which is applicable for heteroscedastic error terms:W = n(R = r)O[R S=1 E(A2 X XO)S=1 RO]=1(R = r) 0 2(q) (5.2)as n . The price to pay is that the distribution of the test statistic under the null hypothesis is only valid asymptotically for large n. That is, the critical values taken from the asymptotic distribution will be useful only for large samples sizes. In case of homoscedastic error terms, one cannXXiiiXXndsubstitute S=1 2 i O =1 2 =1OOE(A XX)SXX i iXX UBXX.OObys SOFinite-sample correction. In order to improve the finite-sample perfor- mance of this test, one usually uses the Fq,n=K distribution with q and n = K degrees of freedoms instead of the 2(q) distribution. Asymptotically (n ) Fq,n=K is equivalent to 2(q). However, for finite sample sizes n (i.e., the practically relevant case) Fq,n=K leads to larger critical values which helpsto account for the estimation errors in S=1 2 i O =1 2 =1)(orins S which are otherwise neglected by the pure asymptotic perspective.135E(A XX)S XXiiXX UBXXOOOH5.2.2.2 Robust Hypothesis Testing: Single ParametersLet us reconsider the case of hypotheses about only one parameter k, withk = 1,…,KWe can selecting the kth diagonal element of the test-statistic in (5.2) andtaking the square root yieldsH0: k=r HA : k = rO n 3 = r 4 k t=This t test statistic allows for heteroscedastic error terms. In case of ho- N(0,1). UES=1 2 O =1 E H0d E(A XX)SXX iiiXXkk OOmoscedastic error terms, one can substitute [S=1 2 i O =1 kkFinite-sample correction. In order to improve the finite-sample perfor- mance of this t test, one usually uses the t(n=K) distribution with n = K degrees of freedoms instead of the N (0, 1) distribution. Asymptotically (n ) t(n=K) is equivalent to N(0,1). However, for finite sample sizes n (i.e., the practically relevant case) tn=K leads to larger critical valuess2[S=1 O].OOUB XXkkwhich helps to account for the estimation errors in [S=1 2 i O =1 kkperspective.OUB XXkk]] ) which are otherwise neglected by the pure asymptotic(or in s2 [S=1 X X ii X X5.3 Robust Confidence IntervalsFollowing the derivations in Chapter 4.6, but using the expression for the ro- bust standard errors, we get the following heteroscedasticity robust (random)136E(A XX)S XX i iXX] byOOE(A XX)S(1 = ) 100% confidence interval CI=C t Un=1[S=1 E(A2XXO)S=1 ] D. 1= k 1=,n=K XOX i i i XOX kkHere, the coverage probability is an asymptotic coverage probability with P(k CI1=)1=asn.5.4 Practice: Large Sample InferenceLets apply the above asymptotic inference methods using R. As in Chapter 4.7 we, first, program a function myDataGenerator() which allows us to generate data from the following model, i.e., from the following fully specified data generating process:Yi =1 +2Xi2 +3Xi3 +Ai, = (1,2,3)O = (2,3,4)OXi2 U[=4,4] Xi3 U[=5,5]Ai U[=0.5|Xi2|,0.5|Xi2|],i=1,…,nwhere (Yi, Xi) is assumed i.i.d. across i = 1, . . . , n. Note that, by contrast to Chapter 4.7, the error terms are not Gaussian and conditionally het- eroscedastic since Var(A |X ) = 1 X2 (the unconditional variance is given by Var(A ) = 4). i i 12 i2 i9Moreover, by contrast to Chapter 4.7, we here do not need to sample newrealizations of Y1, . . . , Yn conditionally on a given data matrix X. (The finite sample inference results in Chapter 4.7 are conditionally on X; however, the large sample results are unconditional results.) Therefore, this option is now removed from the myDataGenerator() R-function.137 ## Function to generate artificial datamyDataGenerator <- function(n, beta){ ##X <- cbind(rep(1, n), runif(n, -4, 4), ##runif(n, -5, 5))eps <- runif(n, -.5 * abs(X[,2]), +.5 * abs(X[,2])) Y <-X%*%beta+epsdata <- data.frame(“Y”=Y,##return(data) }”X_1″=X[,1], “X_2″=X[,2], “X_3″=X[,3])5.4.1 Normally Distributed for n The above data generating process fulfills our regulatory assumptions As-sumption 1u, 2, 3u, and 4. So, by theory, the estimators should be normal kdistributed for large sample sizes n unconditionally on X and even for heteroscedastic error terms.On3=4N10,E=1E(A2XXO)=1E 2 k k d XOX iiiXOXkk Or:For our above specified data generating process, we have N1 , n=1 E=1 E(A2XXO)=1 E 2 kdk XOXiiiXOXkk From the assumed distributions of Xi2 and Xi3 we have that (we used that (i) E(X2) = V (X) if X has mean zero and (ii) that the variance138of uniform distributed random variables 1 (b = a)2): 12 Qc10 0RdQc100Rd =E(S )=E(XXO)=c0 E(X2) 0 d=c0 16 0dXOX XOX i i a i2 2 b a 3 25b 0 0 E(Xi3) 0 0 3 Moreover, E(A2XXO) = E(XXOE(A2|X)) = E1XXO1 1 X222 such that iii iiii ii12i2QE(A2)0 0RQ400R 2Oci22 dc964d E(Ai XiXi) = ca 0 E (Ai Xi2) 0 db = ca0 15 0 db the case of with a largish sample size of n = 100. 20 0 E(A2X2) 0 0 100i i3Lets check the distribution by means of a Monte Carlo simulation for27 set.seed(123)n <- 100 # a largish sample size beta_true <- c(2,3,4) # true data vector## Mean and variance of the true asymptotic ## normal distribution of beta_hat_2:# true meanbeta_true_2 <- beta_true[2]# true variancevar_true_beta_2 <- (solve(diag(c(1, 16/3, 25/3))) %*% diag(c(4/9, 64/15, 100/27))%*%solve(diag(c(1, 16/3, 25/3))))[2,2]/n## Lets generate 5000 realizations from beta_hat_2## conditionally on X and check whether their 139## distribution is close to the true normal distribution rep <- 5000 # MC replicationsbeta_hat_2 <- rep(NA, times=rep)##for(r in 1:rep){MC_data <- myDataGenerator(n = n,beta = beta_true) lm_obj <- lm(Y ~ X_2 + X_3, data = MC_data)beta_hat_2[r] <- coef(lm_obj)[2] }## Compare:## True beta_2 versus average of beta_hat_2 estimates beta_true_2#> [1] 3
round(mean(beta_hat_2), 3)
#> [1] 3
## True variance of beta_hat_2 versus
## empirical variance of beta_hat_2 estimates round(var_true_beta_2, 5)
#> [1] 0.0015
round(var(beta_hat_2), 5)
#> [1] 0.00147
## True normal distribution of beta_hat_2 versus ## empirical density of beta_hat_2 estimates library(scales)
curve(expr = dnorm(x, mean = beta_true_2,
sd=sqrt(var_true_beta_2)), xlab=,ylab=, col=gray(.2), lwd=3, lty=1,
140

xlim=range(beta_hat_2),ylim=c(0,14.1),main=paste0(n=,n)) lines(density(beta_hat_2, bw = bw.SJ(beta_hat_2)),
col=alpha(blue,.5), lwd=3) legend(topleft, lty=c(1,1), lwd=c(3,3),
col=c(gray(.2), alpha(blue,.5)), bty=n, legend= c(expression(
Theoretical (Asymptotic) Gaussian Density of~hat(beta)[2]), expression(
Empirical Density Estimation based on MC realizations from~ hat(beta)[2])))
n=100
Theoretical (Asymptotic) Gaussian Density of ^
Empirical Density Estimation based on MC realizations from 2
2^
2.90 2.95 3.00 3.05 3.10
Great! The nonparametric density estimation (estimated via density()) computed from the simulated realizations of is indicating that is really
141
22
0 2 4 6 8 10 12 14

normally distributed as described by our theoretical result in Theorem 5.2.3 (homoscedastic case) and in Equation (5.1) (heteroscedastic case).
However, is the asymptotic distribution of also usable for (very) small samples like n = 5? Lets check that: 2
set.seed(123)
n <- 5 # a small sample size beta_true <- c(2,3,4) # true data vector## Mean and variance of the true asymptotic ## normal distribution of beta_hat_2:# true meanbeta_true_2 <- beta_true[2]# true variancevar_true_beta_2 <- (solve(diag(c(1, 16/3, 25/3)))%*% diag(c(4/9, 64/15, 100/27))%*%solve(diag(c(1, 16/3, 25/3))))[2,2]/n## Lets generate 5000 realizations from beta_hat_2## conditionally on X and check whether their## distribution is close to the true normal distribution rep <- 5000 # MC replicationsbeta_hat_2 <- rep(NA, times=rep)##for(r in 1:rep){MC_data <- myDataGenerator(n = n,beta = beta_true)lm_obj <- lm(Y ~ X_2 + X_3, data = MC_data)beta_hat_2[r] <- coef(lm_obj)[2] } 142 ## Compare:## True beta_2 versus average of beta_hat_2 estimates beta_true_2#> [1] 3
round(mean(beta_hat_2), 3)
#> [1] 2.996
## True variance of beta_hat_2 versus
## empirical variance of beta_hat_2 estimates round(var_true_beta_2, 5)
#> [1] 0.03
round(var(beta_hat_2), 5)
#> [1] 0.05621
## True normal distribution of beta_hat_2 versus ## empirical density of beta_hat_2 estimates library(scales)
curve(expr = dnorm(x, mean = beta_true_2,
sd=sqrt(var_true_beta_2)), xlab=,ylab=, col=gray(.2), lwd=3, lty=1,
xlim=c(2,4), ylim=c(0,3),main=paste0(n=,n)) lines(density(beta_hat_2, bw = bw.SJ(beta_hat_2)),
col=alpha(blue,.5), lwd=3) legend(topleft, lty=c(1,1), lwd=c(3,3),
col=c(gray(.2), alpha(blue,.5)), bty=n, legend= c(expression(
Theoretical (Asymptotic) Gaussian Density of~hat(beta)[2]), expression(
Empirical Density Estimation based on MC realizations from~ hat(beta)[2])))
143

n=5
Theoretical (Asymptotic) Gaussian Density of ^
Empirical Density Estimation based on MC realizations from 2
2^
2.0 2.5 3.0 3.5 4.0
Not good. The actual distribution has substantially fatter tails. That is, if we would use the quantiles of the asymptotic distribution, we would falsely reject the null-hypothesis too often (probability of type I errors would be larger than the significance level). But asymptotic are kicking in pretty fast here: things become much more reliable already for n = 10.
5.4.2 Testing Multiple and Single Parameters
In the following, we do inference about multiple parameters. We test H0:2=3 and 3=5
versus HA :2 = 3 and/or 3 = 5. 144
0.0 0.5 1.0 1.5 2.0 2.5 3.0

Or equivalently where
H0 :R = r = 0 HA :R=r=0,
R=Qa0 1 0Rb and r=Qa3Rb. 001 5
The following R code can be used to test this hypothesis:
suppressMessages(library(car)) # for linearHyothesis() # ?linearHypothesis
library(sandwich)
## Generate data
MC_data <- myDataGenerator(n = 100,beta = beta_true)## Estimate the linear regression model parameterslm_obj <- lm(Y ~ X_2 + X_3, data = MC_data) vcovHC3_mat <- sandwich::vcovHC(lm_obj, type=”HC3″)## Option 1:car::linearHypothesis(model = lm_obj,hypothesis.matrix = c(“X_2=3”, “X_3=5”),vcov=vcovHC3_mat) #> Linear hypothesis test
#>
#> Hypothesis:
#> X_2 = 3
145

#> X_3 = 5
#>
#> Model 1: restricted model
#> Model 2: Y ~ X_2 + X_3
#>
#> Note: Coefficient covariance matrix supplied.
#>
#> Res.Df Df F Pr(>F)
#>1 99
#> 2 97 2 1150.4 < 2.2e-16 ***#>
#> Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
## Option 2:
R <- rbind(c(0,1,0), c(0,0,1))car::linearHypothesis(model = lm_obj, hypothesis.matrix = R,rhs = c(3,5),vcov=vcovHC3_mat) #> Linear hypothesis test
#>
#> Hypothesis:
#> X_2 = 3
#> X_3 = 5
#>
#> Model 1: restricted model
#> Model 2: Y ~ X_2 + X_3
#>
#> Note: Coefficient covariance matrix supplied.
146
1

#>
#> Res.Df Df F Pr(>F)
#>1 99
#> 2 97 2 1150.4 < 2.2e-16 ***#>
#> Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
The p-value is very small and allows us to reject the (false) null-hypothesis at any of the usual significance levels.
Next, we do inference about a single parameter. We test H0 :3 =5
versus HA :3 = 5.
1
# Load libraries
library(lmtest) # for coeftest() library(sandwich) # for vcovHC()
## Generate data
n <- 100MC_data <- myDataGenerator(n = n,beta = beta_true)## Estimate the linear regression model parameterslm_obj <- lm(Y ~ X_2 + X_3, data = MC_data)## Robust t test## Robust standard error for hat{beta}_3:SE_rob <- sqrt(vcovHC(lm_obj, type = “HC3”)[3,3]) 147## hypothetical (H0) value of beta_3:beta_3_H0 <- 5## estimate for beta_3:beta_3_hat <- coef(lm_obj)[3]## robust t-test statistict_test_stat <- (beta_3_hat – beta_3_H0)/SE_rob## p-valueK <- coef(lm_obj)##p_value <- 2 * min( pt(q = t_test_stat, df = n – K), 1- pt(q = t_test_stat, df = n – K)) #> [1] 1.548442e-65
p_value
Again, the p-value is very small and allows us to reject the (false) null- hypothesis at any of the usual significance levels.
148

Chapter 6
Maximum Likelikhood
6.1 Likelihood Principle
The basic idea behind maximum likelihood estimation is very simple: find the distribution parameters for which it is most likely that the distribution has generated the data we actually observed. We must therefore be very specific about the process that generated the data. This is a trade o by imposing a fair amount of structure on the data, we get in return a very desirable estimator. The question always remains, however, whether we have made the right decision about the specific distribution/density function.
6.2 Properties of Maximum Likelihood Estimators
Why do we like maximum likelihood as an estimation method? The answer is that: A maximum likelihood estimator of some parameter R is
Consistent: plim ( ) = n n
Asymptotically Normally Distributed: On( = ) a N (0, 2) n
149

Asymptotically Ecient: For any other consistent estimator , 2 2 . n
Thus, if we have the right assumptions, maximum likelihood estimators are very appealing.
Simple Example: Coin Flipping (Bernoulli Trial) In the following
we will consider the simple example of coin flips. Call the probability that
we get a head = P (Coin = HEAD) which implies that the probability that
we get a tail is 1 = = P (Coin = TAIL). We dont know the probability
and our goal is to estimate it using an i.i.d. sample of size n, i.e. (X1, . . . , Xn) i.i.d
with Xi B() for all i = 1,,n. A given realization of this random sample is then (x1,,xi,,xn) and consists of xi = 1 values for head andxj =0fortail,i=j=1,,n. Intotalwehave0hnmany heads and 0 n = h n many tails.
6.3 The (Log-)Likelihood Function
How do we combine information from n observations to estimate ?
If we assume that all of the observations are drawn from same distribution and are independent, then joint probability of observing h heads and n = h
tails in the n coin flips that we actually observed, given , is: h n=h
L ( ) = Y ( 1 = )
nx 1=x
= i(1=) i i=1
where xi = 1 forHEADin ith coin flip and xi = 0 forTAILin ith coin flip. The function L is called the likelihood function.
150

In general, when the observations are identically and independently dis-
tributed (i.i.d):
Yn
L() =
f(xi|)
i=1
where f(xi|) is the density function of the random variable Xi evaluated at the realization Xi = xi; the parameter denotes the density function
parameter(s).
Our goal is to choose a value for such that the value of the likelihood
function is at a maximum, i.e. we choose the value of the parameter(s) that maximize the probability or better the likelihood of observing the data that we actually observed. That is:
= arg max L().
defines the maximum likelihood (ML) parameter estimator .
Usually its easier to work with sums rather than products, so we can apply a monotonic transformation (taking ln) to the likelihood which yields to the log-likelihood function:
() = lnL() = yn lnf(xi|).
i=1
()=yn (xiln()+(1=xi)ln(1=))
i=1
In this case, we can analytically solve for the value of that maximizes
the log likelihood (and hence also the likelihood):
In our coin flipping example:
d yn A 1 1 B d = xi =(1=xi)1=
i=1 =h=n=h
1= 151

Setting the above expression to zero and solving gives us our ML estimator
(MLE):
=h ML n
6.4 Optimization: Non-Analytical Solutions
Usually we are not so fortunate as to have a closed-form analytical solution for the MLE and must rely on the computer to find the maximizing arguments of the (log-)likelihood function. Various methods exist for finding the maximum (or minimum) of a function. General idea: (i) start at some value for parameters in the parameter space (i.e., in the space of all possible parameter- values), (ii) search across that space until values of parameters are found that yield a derivative of the log likelihood that is zero (or arbitrarily close to zero).
6.4.1 Newton-Raphson Optimization
One of the most-used methods for optimization is the Newton-Raphson method (or a variant of it). The Newton-Raphson method relies on Taylor- series approximations of the likelihood.
First-order and second-order Taylor-series approximations of a function value f(+h) are using the function value f() and derivatives of f evaluated at :
First-order: f ( + h) f () + f O()h
Second-order: f () + f O()h + (1/2)f OO()h2
for small values of h.
Aim: Suppose we want to find the value of that maximizes a twice- dierentiable function f().
152

Implementation-Idea: The Taylor-series approximation gives then f( + h) f() + fO()h + (1/2)fOO()h2
which implies
f( + h) fO() + fOO()h. h
Therefore, the first-order condition for the value of h that maximizes the Taylor-series expansion f() + fO()h + (1/2)fOO()h2 is
0 = fO() + fOO()h, h = = f O ( ) .
giving
That is, in order to increase the value of f() one shall substitute by
fOO()
+ h = = f O ( )
The Newton Raphson optimization algorithm uses this insight as following. We first must provide a starting value, s, for 0 = s and, second, decide on some (small) convergence criterion, t, e.g. t = 10=10, for the first derivative. Then the Newton Raphson optimization algorithm is given by:
let 0 = s let i = 0 while
|SfO(i)| > t
U let i = i + 1
let= =fO(i=1) i i=1 fOO(i=1)
do
let = i
fOO()
return ()
153

Repetition i O( )O( )/ OO( ) iiii
0 0.40 1 0.16 2 0.19 3 0.19 4 0.19 5 0.20
=4.16 1.48
2.15 10=1 5.43 10=3 3.53 10=6 1.50 10=12
2.40 10=1 =3.32 10=2 =6.55 10=3 =1.73 10=4 =1.13 10=7 =4.81 10=14
Table 6.1: Result of applying the Newton Raphson optimaization algorithm to our coin flipping example for given data h = 1 with sample size n = 5.
Note. For problems that are globally concave, the starting value s doesnt matter. For more complex problems, however, the Newton-Raphson algorithm can get stuck into a local maximum. In such cases, it is usually a good idea to try multiple starting values.
Newton-Raphson Algorithm: Example Lets return to our earlier coin-
flipping example, with only one head h = 1 for a sample size of n = 5. We
already know that = h = 1 = 0.2, but lets apply the Newton-Raphson ML n 5
Algorithm. Recall that
d = h = n = h
d 1= d2 ==h= n=h
d2 2 (1=)2
We have h = 1 and n = 5. Choosing t = 10=10 as our convergence criterion and 0 = 0.4 as the starting value, allows us to run the algorithm which gives us the results shown in Table 6.1.
154

6.5 OLS-Estimation as ML-Estimation
Now lets return to our linear model
Y =X+A
To apply ML-estimation, we must make a distributional assumption about A
such as, for instance:
A N(0,2In)
Thats Assumption 4u from Chapter 4; we could have chosen also another distributional assumption for A here, but we would need to specify it correctly. That is, we are imposing here much more structure on the data than needed with the OLS estimator (in the context of large sample inference).
The multivariate density for A = (A1, . . . , An)O is then f(A) = 1 e=(1/22)(AO A).
(2fi2)n/2
Noting that A = Y = X, we get the log likelihood
(,2)==nln(2fi)=nln(2)= 1 (Y =X)O(Y =X) 2 2 22
with K unknown parameters = (1, . . . , K )O and 2 (scalar). Taking derivatives gives
= = 1 (=XOY + XOX) 2
== n +C1(Y =X)O(Y =X)D 2 22 2
155
1 2 (2)

So, we have K + 1 equations and K + 1 unknowns. Setting equal to zero and solving gives
= (XOX)=1XOY
ML 1 1yn
s2ML = n(Y = XML)O(Y = XML) = n A2i i
Thus, the MLE of the linear model, , is the same as the OLS estimator, . ML
Moreover, since the ML estimator M L is here equivalent to the OLS estimator (same formula, same mean, same variance) we can use the whole inference
machinery (t-test, F -test, confidence intervals) from the last chapters.
As it is needed for the next chapter, we also give here the second derivatives of the log-likelihood function L as well as the expressions of minus one times
the mean of the second derivatives of the log-likelihood function L: 2 = = 1 (XOX)
2
(=1)EQa2 Rb=1E(XOX)
2
2 = n2=[(Y=X)O(Y=X)] 13
22 2 (2) (2) n q ni = 1 A 2i
=24= 6q
( = 1 ) E Qa 2 Rb = = n + E [ ni = 1 A 2i ]
22 24 6
= = n + n2 = n
24 6 24 156

2 = 2L ==XO(Y=X)=XOA 2 2 4 4
(=1)EQa 2L Rb=E(XOA)=E[E(XOA|X)] 2 4 4
= E[XOE(A |X)] = 0 4
6.6 Variance of ML-Estimators and s2 ML ML
The variance of an MLE is given by the inverse of the Fisher information matrix. The Fisher information matrix is defined as minus one times the expected value matrix of second derivatives of the log-likelihood function. For the OLS case, the Fisher information matrix is
IQ R=S 1E(XOX) 0 T=S n O 0 T,
24
abU2
2 0n 0n
VU2XX V 24
where we used that E(XOX) = E(qni=1 XiXiO) = nE(XiXiO) = nXOX. The upper left element of the Fisher information matrix is easily shown, but the derivationofthelowerrightelementisrathertedious.1 So,takingtheinverse of the Fisher information matrix gives the variance-covariance matrix of the
estimators ML
Given this result, it is easy to see that Var( ) 0 and Var(s2
n . ML ML
and s2 ML
VarQR=S2=1 0T aMLb UnXOX V,
s 2M L 0 2 4 n
1See https://www.statlect.com/fundamentals-of-statistics/linear-regression-maximum- likelihood for more details.
157
) 0 as

6.7 Consistency of and s2 ML ML
If E[A|X] = 0 (strict exogeneity, follows from our Assumptions 1 and 2), then the bias of is zero since E[ ] =
E[ ] = E[(XOX)=1XO(X + A)]
ML
ML
= E[E[(XOX)=1XO(X + A)|X]]
= E[E[(XOX)=1XOX|X]] + E[E[(XOX)=1XO A |X]] = E[E[|X]] + E[(XOX)=1XOE[A |X]]
= + E[(XOX)=1XOE[A |X]]
=
E[ ]==Bias( )=0 ML ML
Of course, from this it also follows that the squared bias is equal to zero Bias2( ) = 0. This implies that the mean square error (MSE) of the ML
ML estimator ML equals the variance of the ML estimator ML:
MSE( )= E[( =)2]=Var( ) 0 as n. ML ML ML
MSE(ML)=Var(ML) since ML is unbiased.
Since convergence in mean square implies convergence in probability, we have
established that the ML-estimator is a (weakly) consistent estimator of ML

asn. ML p
Moreover, one can also show that s2ML is a biased but asymptotically unbiased estimator, that is Bias2(s2ML) 0 as n . Together with the result that Var(s2ML) 0 as n we have that
MSE(s2ML) = E[(s2ML = 2)2] = Bias2(s2ML) + Var(s2ML) 0 as n . Again, since convergence in mean square implies convergence in probability,
we have established that the ML-estimator s2ML is a (weakly) consistent 158

estimator of 2
In practice, however, one usually works with the unbiased (and consistent)
s2MLp2 as n.
alternatives2UB = 1 qni=1A2i eventhoughonecanshowthatMSE(s2ML)< 2 n=K MSE(UB) for suciently large n.6.8 Asymptotic Theory of Maximum-Likelihood EstimatorsSo far, we only considered consistency of the ML-estimators. In the following, we consider the asymptotic distribution of ML-estimators. We only consider the simplest situation: Assume an i.i.d. sample X1,…,Xn, and suppose that the distribution of Xi possesses a density f(x|). The true parameter R is unknown. (Example: density of an exponential distribution f(x|) = exp(=x)) Likelihood function: Log-likelihood function: The maximum-likelihood estimator maximizes () nnIt can generally be shown that is a consistent estimator of . Derivation nof the asymptotic distribution relies on a Taylor expansion (around ) of the 159Yn i=1Ln() = n() = lnL() = yn lnf(Xi|)f(Xi|) i=1derivativeOn() of the log-likelihood function. By the so called Mean Value Theorem, we then know that for some A between and we have O ( ) =O () +OO(A )( = ) (Mean Value Theorem) nnnnnnSince maximizes the log-Likelihood function it follows thatO ( ) = 0. nnnThis implies (sinceO (n) =O () =OO(An)(n = )) that nnn O () = = OO(A )( = ). n nnn(6.1) Note that necessarily s f(x|)dx = 1 for all possible values of the true=parameter . Therefore, s f(x|)dx = 0 and s 2 f(x|)dx = 0.= = 2 Using this, we now show that the averagenn1yn lnf(Xi|) n i=1 is asymptotically normal. For the mean of 1 qni=1 lnf(Xi|) one gets: n A 1 B 1 Q yn R n f ( x | )EO() = E lnf(X|) = f(x|)dx=0. nn nqai=1ibn=f(x|) For the variance of 1 ni=1 lnf(Xi|) one gets: n A1 B n A B 1 QQ f(X|)R2R 1 Var O()=Var lnf(X|)=Ecaa ibdb=J() nn n2 i n f(Xi|) n Moreover, the average 1 qni=1 lnf(Xi|) is taken over i.i.d.~random vari- n =:J ()ables ln f (Xi|). So, we can apply the Lindeberg-Levy central limit theorem from which it follows that 1 O ( ) = O ( ) N ( 0 , 1 )160nnn nn O1J() OnJ() nd Thus (using (6.1)) we also have= OO(An)( =) N(0,1) (6.2)Onnd nJ()Further analysis requires to study the termOO(An). We begin this with studying the mean of n1 OO()=1yn 2 lnf(Xi|)=1yn Alnf(Xi|)B nn ni=1 ni=1 =1yn Qf(Xi|)R a b n i=1 f(Xi|)= 1 yn Qc3 2 f(Xi|)4 f(Xi|) = f(Xi|) f(Xi|)Rdc dn i=1 a (f(Xi|))2 Taking the mean of 1 OO() yields:bnn 1n Q 2 f(X|) Q f(X|)R2R Eca2 i =a i b dbE( OO())= n nnnn f(Xi|) f(Xi|)QQ f(X |)R2R= 0 = E ca a i b db = = J ( ) f(Xi|) Taking the variance of 1 OO() yields:VarA1 OO()B= 1n VarQa 2 lnf(X|)Rb 0 as n nn n2 i =some fixed, deterministic numberSo, 1OO () is an unbiased estimator for =J () and thusEQaA1 OO()=(=J())B2Rb =VarA1 OO()B 0 nn nn161nn as nThat is 1 OO() is mean square consistent nn1 OO() m.s. =J() as n nnwhich implies that 1 OO() is also (weakly) consistent nn1 OO() p =J() as n nnsince mean square convergence implies convergence in probability.Remember: We wanted to studyOO(An) in (6.2) not 1 OO(), but we are nnn actually close now. We know that the ML estimator n is (weakly) consistent, i.e., . From this it follows that also A since A is a valuenpnppbetween and (Mean Value Theorem). Therefore, we have that also n1 OO(An) p =J() as n . nn Multiplying by =1/OJ () yields=1 OO(A ) = OO(A ) O nnn=n=1/2 nnJ().OJ() OnJ()pRewriting the quotient in (6.2) a little bit yields= OO(A ) = OO(A ) O n n (n =)=n=1/2O n n n1/2(n =).OJ()n1/2( =) N(0,1), nd162nJ() nJ() pOJ() Thus we can conclude that (using (6.2)):or equivalently(=)NQa0, 1 Rb n d nJ() with nJ() = =E( OO()) = I() which is the asymptotic normality result nwe aimed, where I() is called the Fisher information.The above arguments can easily be generalized to multidimensional pa-rameter vectors RK . In this case, J () becomes a K K matrix,and = N A 0 , 1 J ( ) = 1 B , ndKn where nJ () = =E( OO()) = I() is called Fisher information matrix. n163Chapter 7Instrumental Variables RegressionRegression models may suer from problems like omitted variables, mea- surement errors and simultaneous causality. If so, the error term Ai is correlated with the regressor, Xik say, and the corresponding coecient of interest, k, is estimated inconsistently. If one is lucky, one can add, for instance, the omitted variables to the regression to mitigate the risk of biased estimations (omitted variables bias). However, if omitted vari- ables cannot be measured or are not available for other reasons, multiple regression cannot solve the problem. The same issue arises if there is simul- taneous causality. When causality runs from X to Y and vice versa (e.g. if Y = Demanded quantity of a good and X = Price of this good), there will be an estimation bias that cannot be corrected for by multiple regression.A general technique for obtaining a consistent estimator of the coecient of interest is instrumental variables (IV) regression. In this chapter we focus on the IV regression tool called two-stage least squares (TSLS). The first sections briefly recap the general mechanics and assumptions of IV regression and show how to perform TSLS estimation using R. Next, IV regression is used for estimating the elasticity of the demand for cigarettes a classical example where multiple regression fails to do the job because of simultaneous causality.1657.1 The IV Estimator with a Single Regressor and a Single InstrumentConsider the simple regression modelYi =1 +2Xi +Ai , i=1,…,n, (7.1)where the error term Ai is correlated with the regressor Xi (X is the called endogenous). In this case Assumption 2 is violated, that is, strict exogeneity and orthogonality between Xi and Ai do not hold. Therefore, OLS estimation(also maximum likelihood and methods of moments estimation) is inconsis- tent for the true 2. In the most simple case, IV regression uses a single instrumental variable Zi to obtain a consistent estimator for 2.Conditions for valid instruments: Zi must satisfy two conditions to be a valid instrument:1. Instrument relevance condition:Xi and its instrument Zi must be correlated: flZ,X = 0.2. Instrument exogeneity condition:E(Ai |Zi) = 0. As a consequence: The instrument Zi must not be correlated with the error term Ai: flZ,A = 0.7.1.1 The Two-Stage Least Squares EstimatorAs can be guessed from its name, TSLS proceeds in two stages. In the first stage, the variation in the endogenous regressor Xi is decomposed into a problem-free component that is explained by the (exogenous) instrument Zi and a problematic component that is correlated with the error Ai. The secondstage uses the problem-free component of the variation in Xi to estimate 2. 166The first stage regression model isXi = fi0 + fi1Zi + i,where fi0 + fi1Zi is the component of Xi that is explained by Zi while i is the component (an error term) that cannot be explained by Zi and exhibits correlation with Ai.Using the OLS estimates fi0 and fi1 we obtain predicted values Xi =fi0 + fi1Zi, i = 1, . . . , n. If Zi is a valid instrument, the Xi are problem-freein the sense that Xi is exogenous in a regression of Yi on Xi which is donein the second stage regression. The second stage produces TSLS and TSLS,the TSLS estimates of 1 and 2.For the case of a single instrument one can show that the TSLS estimatorof 2 iswhich is nothing but the ratio of the sample covariance between Zi and Yi to the sample covariance between Zi and Xi.The estimator in (7.2) is a consistent estimator for 2 in (7.1) underthe assumption that Zi is a valid instrument. The CLT implies that the1 qnTSLS = sZY = n=1q i=1(Yi = Y )(Zi = Z) , (7.2)12 2 s 1 ni = 1 ( X = X ) ( Z = Z ) ZX n=1 i i distribution of TSLS can be approximated by a normal distribution if the 2sample size n is large. This allows us to use t-statistics, F -statistics, confidence intervals, etc.7.1.2 Application: Demand For Cigarettes (1/2)The relation between the demand for and the price of commodities is a simple yet widespread problem in economics. Health economics is concerned with the study of how health-aecting behavior of individuals is influenced by the health-care system and regulation policy. Probably the most prominent167example in public policy debates is smoking as it is related to many illnesses and negative externalities.It is plausible that cigarette consumption can be reduced by taxing cigarettes more heavily. The question is by how much taxes must be increased to reach a certain reduction in cigarette consumption. Economists use elasticities to answer this kind of question. Since the price elasticity for the demand of cigarettes is unknown, it must be estimated. A simple OLS regression of log quantity on log price cannot be used to estimate the eect of interest since there is simultaneous causality between demand and supply. Instead, IV regression can be used.We use the data set CigarettesSW which comes with the package AER. It is a panel data set that contains observations on cigarette consumption and several economic indicators for all 48 continental federal states of the U.S. from 1985 to 1995. In the following, however, we consider data for the cross section of states in 1995 only that is, we transform the panel data to a cross-sectional data set. We start by loading the package, attaching the data set. An overview about summary statistics for each of the variables is returned by summary(CigarettesSW). Use ?CigarettesSW for a detailed description of the variables.#>stateyear cpipopulation
#>AL : 2 1985:48 Min. :1.076 Min. :478447
# load the data set and get an overview
library(AER) data(CigarettesSW) summary(CigarettesSW)
#> AR :2 #> AZ :2 #> CA :2 #> CO :2
1995:48
1st Qu.:1.076
Median :1.300
Mean :1.300
3rd Qu.:1.524
168
1st Qu.: 1622606
Median : 3697472
Mean : 5168866
3rd Qu.: 5901500

#>CT : 2 Max. :1.524 Max. :31493524
#>(Other):84
#>packsincome tax
#>Min. : 49.27 Min. :6887097 Min. :18.00
#>1st Qu.: 92.45 1st Qu.: 25520384 1st Qu.:31.00
#>Median :110.16 Median : 61661644 Median :37.00
#>Mean :109.18 Mean : 99878736 Mean :42.68
#>3rd Qu.:123.52 3rd Qu.:127313964 3rd Qu.:50.88
#>Max. :197.99 Max. :771470144 Max. :99.00
#>
#>price
#>Min. : 84.97
#>1st Qu.:102.71
#>Median :137.72
#>Mean :143.45 Mean : 48.33
#>3rd Qu.:176.15 3rd Qu.: 59.48
#>Max. :240.85 Max. :112.63
#>
taxs
Min. : 21.27
1st Qu.: 34.77
Median : 41.05
We are interested in estimating 2 in
log(Qcigarettes)= + log(Pcigarettes)+A,
(7.3) where Qcigarettes is the number of cigarette packs per capita sold and P cigarettes
i12ii ii
is the after-tax average real price per pack of cigarettes in state i = 1, . . . , n = 48. The instrumental variable we are going to use for instrumenting the
endogenous regressor log(Pcigarettes) is SalesTax, the portion of taxes on i
cigarettes arising from the general sales tax. SalesTax is measured in dollars per pack. The idea is that:
1. SalesTax is a relevant instrument as it is included in the after-tax average price per pack.
169

2. Also, it is plausible that SalesTax is exogenous since the sales tax does not influence quantity sold directly but indirectly through the price.
In the following, we perform some transformations in order to obtain
deflated cross section data for the year 1995. We also compute the sample
correlation between the sales tax and price per pack. The sample correlation
is a consistent estimator of the population correlation. The estimate of
approximately 0.614 indicates that SalesTax and Pcigarettes exhibit positive i
correlation which meets our expectations: higher sales taxes lead to higher prices. However, a correlation analysis like this is not sucient for checking whether the instrument is relevant. We will later come back to the issue of checking whether an instrument is relevant and exogenous; see Chapter 7.3.
# compute real per capita prices
CigarettesSW$rprice <- with(CigarettesSW, price / cpi) # compute the sales taxCigarettesSW$salestax <- with(CigarettesSW, (taxs – tax) / cpi)# check the correlation between sales tax and pricecor(CigarettesSW$salestax, CigarettesSW$price) #> [1] 0.6141228
# generate a subset for the year 1995
c1995 <- subset(CigarettesSW, year == “1995”) The first stage regression islog(P cigarettes) = fi + fi SalesT ax + .i01iiWe estimate this model in R using lm(). In the second stage we run acigarettes cigarettesregression of log(Q ) on log(P ) to obtain TSLS and TSLS.ii12 170# perform the first stage regressioncig_s1 <- lm(log(rprice) ~ salestax, data = c1995)coeftest(cig_s1, vcov = vcovHC, type = “HC1”) #>
#> t test of coefficients:
#>
#>Estimate Std. Errort valuePr(>|t|)
#> (Intercept) 4.61654630.0289177 159.6444 < 2.2e-16 ***#> salestax0.03072890.0048354 6.3549 8.489e-08 ***
#>
#> Signif. codes:0 *** 0.001 ** 0.01 * 0.05 . 0.1
The first stage regression is
cigarettes
log(Pi ) = 4.62 + 0.031 SalesT axi
which predicts the relation between sales tax price per cigarettes to be positive. How much of the observed variation in log(Pcigarettes) is explained by the instrument SalesTax? This can be answered by looking at the regressions R2 which states that about 47% of the variation in after tax prices is explained by the variation of the sales tax across states.
cigarettes
We next store log(Pi ), the fitted values obtained by the first stage
regression cig_s1, in the variable lcigp_pred. 171
(0.03) (0.005)
1
# inspect the R^2 of the first stage regression
summary(cig_s1)$r.squared #> [1] 0.4709961

# store the predicted values
lcigp_pred <- cig_s1$fitted.values Next, we run the second stage regression which gives us the TSLS estimates we seek.# run the stage 2 regressioncig_s2 <- lm(log(c1995$packs) ~ lcigp_pred) coeftest(cig_s2, vcov = vcovHC, type = “HC1”) #>
#> t test of coefficients:
#>
#> Estimate Std. Error t valuePr(>|t|)
#> (Intercept) 9.7199 1.59716.0859 2.153e-07 ***
#> lcigp_pred -1.0836 0.3337 -3.24720.002178 **
#>
#> Signif. codes:0 *** 0.001 ** 0.01 * 0.05 . 0.1
Thus estimating the model (7.3) using TSLS yields cigarettes cigarettes
log(Qi ) = 9.72 + 1.08log(Pi ), (7.4) (1.60) (0.33)
The function ivreg() from the package AER carries out TSLS procedure automatically. It is used similarly as lm(). Instruments can be added to the usual specification of the regression formula using a vertical bar separating the model equation from the instruments. Thus, for the regression at hand the correct formula is log(packs) ~ log(rprice) | salestax.
172
# perform TSLS using ivreg()
cig_ivreg <- ivreg(log(packs) ~ log(rprice) | salestax,1 data = c1995)coeftest(cig_ivreg, vcov = vcovHC, type = “HC1”) #>
#> t test of coefficients:
#>
#> Estimate Std. Error t valuePr(>|t|)
#> (Intercept)9.719881.528326.3598 8.346e-08 ***
#> log(rprice) -1.083590.31892 -3.39770.001411 **
#>
#> Signif. codes:0 *** 0.001 ** 0.01 * 0.05 . 0.1
We find that the coecient estimates coincide for both approaches.
Two Notes on the Computation of TSLS Standard Errors:
1. We have demonstrated that running the individual regressions for each stage of TSLS using lm() leads to the same coecient estimates as when using ivreg(). However, the standard errors reported for the second-stage regression, e.g., by coeftest() or summary(), are invalid: neither adjusts for using predictions from the first-stage regression as regressors in the second-stage regression. Fortunately, ivreg() performs the necessary adjustment automatically. This is another advantage over manual step-by-step estimation which we have done above for demonstrating the mechanics of the procedure.
2. Just like in multiple regression it is important to compute heteroskedasticity-robust standard errors as we have done above using vcovHC().
The TSLS estimate for 2 in (7.4) suggests that an increase in cigarette prices by one percent reduces cigarette consumption by roughly 1.08 percent-
173
1

age points, which is fairly elastic. However, we should keep in mind that this estimate might not be trustworthy even though we used IV estimation: there still might be a bias due to omitted variables. Thus a multiple IV regression approach is needed to reduce the risk of omitted variable biases.
7.2 The General IV Regression Model
The simple IV regression model is easily extended to a multiple regression model which we refer to as the general IV regression model. In this model we distinguish between four types of variables: the dependent variable, included exogenous variables, included endogenous variables and instrumental variables:
Yi =1 +2X2i ++KXKi +K+1W1i ++K+rWri +Ai, (7.5) with i = 1, . . . , n is the general instrumental variables regression model where
Yi is the dependent variable
1, . . . , K+r are K + r unknown regression coecients
X2i, . . . , XKi are K = 1 endogenous regressors
W1i, . . . , Wri are r exogenous regressors which are uncorrelated with Ai Ai is the error term
Z1i, . . . , Zmi are m instrumental variables
The coecients are overidentified if m > (K = 1). If m < (K = 1), the coecients are underidentified and when m = (K = 1) they are exactly174identified. For estimation of the IV regression model we require exact identification or overidentification.Estimating regression models with TSLS using multiple instruments by means of ivreg() is straightforward. There are, however, some subtleties in correctly specifying the regression formula. Assume that you want to estimate the modelYi =1 +2X2i +3X3i +W1i +Aiwhere X2i and X3i are endogenous regressors that shall be instrumented by Z1i, Z2i and Z3i, and where W1i is an exogenous regressor. Say the corresponding data is available in a data.frame with column names y, x2, x3, w1, z1, z2 and z3. It might be tempting to specify the argument formula in your call of ivreg()asy~x2+x3+w1|z1+z2+z3which is, however, wrong. As explained in the documentation of ivreg() (see ?ivreg), it is necessary to list all exogenous variables as instruments too, that is joining them by +s on the right of the vertical bar: y ~ x2 + x3 + w1 | w1 + z1 + z2 + z3, where w1 is instrumenting itself.Similarly to the simple IV regression model, the general IV model (7.5) can be estimated using the two-stage least squares estimator: First-stage regression(s):Run an OLS regression for each of the endogenous variables(X2i,…,XKi) on all instrumental variables (Z1i,…,Zmi), all exoge- nous variables (W1i, . . . , Wri) and an intercept. Compute the fitted values (X2i, . . . , XKi). Second-stage regression:Regress the dependent variable on the predicted values of all endogenous regressors, all exogenous variables and an intercept using OLS. Thisgives TSLS,…,TSLS, the TSLS estimates of the model coecients. 1 K+r175In the general IV regression model, the instrument relevance and in- strument exogeneity assumptions are equivalent to the case of the simple regression model with a single endogenous regressor and only one instrument. That is, for Z1i, . . . , Zmi to be a set of valid instruments, the following two conditions must be fulfilled:1. Instrument RelevanceIf there are K = 1 endogenous variables, r exogenous variables andmK=1instrumentsandtheXu,…,Xu arethepredictedvalues 2i Kifrom the K = 1 population first stage regressions, it must hold that (Xu ,…,Xu ,W1i,…,Wri,1)2i Kiare not perfectly multicollinear, where 1 denotes the constant regressor (intercept) which equals 1 for all observations. Explanations: Lets say there is only one endogenous regressor Xi. If all the instruments Z1i,…,Zmi are irrelevant, all the Xiu are just the mean of X such that there is perfect multicollinearity with the constant intercept 1.2. Instrument ExogeneityE(Ai |Z1i, . . . , Zim) = 0. Consequently, all m instruments must be uncorrelated with the error term,flZ1,A = 0,…,flZm,A = 0.One can show that if the IV regression assumptions hold, the TSLS estimator in (7.5) is consistent and normally distributed when the sample size n is large. That is, if we have valid instruments, we obtain valid statistical inference using t-tests, F-tests and confidence intervals for the model coecients.1767.2.1 Application: Demand for Cigarettes (2/2)The estimated elasticity of the demand for cigarettes in (7.1) is 1.08. Although (7.1) was estimated using IV regression it is plausible that this IV estimate is biased. The TSLS estimator is inconsistent for the true 2 if the instrument (here: the real sales tax per pack) is invalid, i.e., if the instrument correlates with the error term. This is likely to be the case here since there are economic factors, like state income, which impact the demand for cigarettes and correlate with the sales tax. States with high personal income tend togenerate tax revenues by income taxes and less by sales taxes. Consequently, state income should be included in the regression model.log(Qcigarettes)= + log(Pcigarettes)+ log(income)+A (7.6) i12i3iiBefore estimating (7.6) using ivreg() we define income as real per capita income rincome and append it to the data set CigarettesSW.# add real income to the dataset (cpi: consumer price index)CigarettesSW$rincome <- with(CigarettesSW,income / population / cpi)c1995 <- subset(CigarettesSW, year == “1995”) # estimate the modelcig_ivreg2 <- ivreg(log(packs) ~ log(rprice) +log(rincome) | log(rincome) +salestax, data = c1995)coeftest(cig_ivreg2, vcov = vcovHC, type = “HC1”) #>
#> t test of coefficients:
177

#>
#>Estimate Std. Error t valuePr(>|t|)
#> (Intercept) 9.43066
#> log(rprice)-1.14338
#> log(rincome)0.21452
#>
1.259397.4883 1.935e-09 ***
0.37230 -3.07110.003611 **
0.311750.68810.494917
#> Signif. codes:0 *** 0.001 ** 0.01 * 0.05 . 0.1
We obtain
cigarettes cigarettes
log(Qi ) = 9.42 = 1.14 log(Pi ) + 0.21 log(incomei). (7.7)
(1.26) (0.37) (0.31)
In the following we add the cigarette-specific taxes (cigtaxi) as a further
instrumental variable and estimate again using TSLS.
1
# add cigtax to the data set
CigarettesSW$cigtax <- with(CigarettesSW, tax/cpi) c1995 <- subset(CigarettesSW, year == “1995”) # estimate the modelcig_ivreg3 <- ivreg(log(packs) ~ log(rprice) + log(rincome) | log(rincome) + salestax + cigtax,data = c1995)coeftest(cig_ivreg3, vcov = vcovHC, type = “HC1”) #>
#> t test of coefficients:
#>
#>Estimate Std. Error t valuePr(>|t|)
178

#> (Intercept) 9.89496
#> log(rprice)-1.27742
#> log(rincome)0.28040
#>
0.95922 10.3157 1.947e-13 ***
0.24961 -5.1177 6.211e-06 ***
0.253891.10440.2753
#> Signif. codes:0 *** 0.001 ** 0.01 * 0.05 . 0.1
Using the two instruments salestaxi and cigtaxi we have m = 2 for
one endogenous regressor so the coecient on the endogenous regressor
log(Pcigarettes) is overidentified. The TSLS estimate of (7.6) is i
cigarettes cigarettes
log(Qi ) = 9.89 = 1.28 log(Pi ) + 0.28 log(incomei). (7.8)
Should we trust the estimates presented in (7.7) or rather rely on (7.8)? The estimates obtained using both instruments are more precise since in (7.8) all standard errors reported are smaller than in (7.7). In fact, the standard error for the estimate of the demand elasticity is only two thirds of the standard error when the sales tax is the only instrument used. This is due to more information being used in estimation (7.8). If the instruments are valid, (7.8) can be considered more reliable.
However, without insights regarding the validity of the instruments it is not sensible to make such a statement. This stresses why checking instrument validity is essential. Chapter 7.3 briefly discusses guidelines in checking instrument validity and presents approaches that allow to test for instrument relevance and exogeneity under certain conditions. These are then used in an application to the demand for cigarettes in Chapter 7.4.
179
(0.96) (0.25) (0.25)
1

7.3 Checking Instrument Validity 7.3.1 Instrument Relevance
Instruments that explain little variation in the endogenous regressor Xi are called weak instruments. Weak instruments provide little information about the variation in Xi that is exploited by IV regression to estimate the eect of interest: the estimate of the coecient on the endogenous regressor is estimated inaccurately. Moreover, weak instruments cause the distribution of the estimator to deviate considerably from a normal distribution even in large samples such that the usual methods for obtaining inference about the true coecient on Xi may produce wrong results.
A Rule of Thumb for Checking for Weak Instruments: Consider the case of a single endogenous regressor Xi and m instruments Z1i, . . . , Zmi. If the coecients on all instruments in the population first-stage regression of a TSLS estimation are zero, the instruments do not explain any of the variation in the Xi which clearly violates assumption that instruments must be relevant. Although the latter case is unlikely to be encountered in practice, we should ask ourselves to what extent the assumption of instrument relevance should be fulfilled. While this is hard to answer for general IV regression, in the case of a single endogenous regressor Xi one may use the following rule of thumb: Compute the F-statistic which corresponds to the hypothesis that the coecients on Z1i, . . . , Zmi are all zero in the first-stage regression. If the F-statistic is less than 10, the instruments are weak such that the TSLS estimate of the coecient on Xi is probably biased and no valid statistical inference about its true value can be made.
This rule of thumb is easily implemented in R. Run the first-stage re- gression using lm() and subsequently compute the heteroskedasticity-robust F -statistic by means of linearHypothesis(). This is part of the application to the demand for cigarettes discussed in Chapter 7.4.
180

If Instruments are Weak
There are two ways to proceed if instruments are weak:
1. Discard the weak instruments and/or find stronger instruments. While the former is only an option if the unknown coecients remain identified when the weak instruments are discarded, the latter can be very dicult and even may require a redesign of the whole study.
2. Stick with the weak instruments but use methods that improve upon TSLS in this scenario, for example limited information maximum likeli- hood estimation. (Out of the scope of this course.)
3. Use tests that allow for inferences robust to weak instruments:
Anderson-Rubin test
7.3.2 Instrument Validity
If there is correlation between an instrument and the error term, IV regression is not consistent. The overidentifying restrictions test (also called the J- test) is an approach to test the hypothesis that additional instruments are exogenous. For the J-test to be applicable there need to be more instruments than endogenous regressors.
The J-Statistic (or Sargan-Hansen test) Take ATSLS, i = 1,,n, i
the residuals of the TSLS estimation of the general IV regression model (7.5). Run the OLS regression
ATSLS =0 +1Z1i ++mZmi +m+1W1i ++m+rWri +ei i
and test the joint hypothesis
H0 :1 =m =0
181
(7.9)

which states that all instruments are exogenous. This can be done using the corresponding F-statistic by computing
J = mF.
This test is the overidentifying restrictions test and the statistic is called the
J-statistic with
J0 2 as n
H
m=(K =1)
d
under the assumption of homoskedasticity. The degrees of freedom m = (K = 1) state the degree of overidentification since this is the number of instruments m minus the number of endogenous regressors K = 1.
It is important to note that the J-statistic is only 2m=(K=1) distributed when the error term Ai in the regression (7.9) is homoskedastic. A discussion of the heteroskedasticity-robust J-statistic is beyond the scope of this chapter. The application in the next section shows how to apply the J-test using linearHypothesis().
7.4 Application to the Demand for Cigarettes
Are the general sales tax and the cigarette-specific tax valid instruments? If not, TSLS is not helpful to estimate the demand elasticity for cigarettes discussed in Chapter 7.2. As discussed in Chapter 7.1, both variables are likely to be relevant but whether they are exogenous is a dierent question.
One can argue that cigarette-specific taxes could be endogenous because there might be state specific historical factors like economic importance of the tobacco farming and cigarette production industry that lobby for low cigarette specific taxes. Since it is plausible that tobacco growing states have higher rates of smoking than others, this would lead to endogeneity of cigarette specific taxes. If we had data on the size on the tobacco and cigarette
182

industry, we could solve this potential issue by including the information in the regression. Unfortunately, this is not the case.
However, since the role of the tobacco and cigarette industry is a factor that can be assumed to dier across states but not over time we may exploit the panel structure of CigarettesSW. Alternatively, a (non-panel) regression using data on changes between two time periods eliminates such state specific and time invariant eects. Next, we consider such changes in variables between 1985 and 1995. That is, we are interested in estimating the long-run
elasticity of the demand for cigarettes.
The model to be estimated by TSLS using the general sales tax and the
cigarette-specific sales tax as instruments hence is log(Qcigarettes)=log(Qcigarettes)= + Elog(Pcigarettes)=log(Pcigarettes)E
i,1995 i,1985 1 2 i,1995 i,1985
+ 3 [log(incomei,1995) = log(incomei,1985)] + Ai .
(7.10)
We first create dierences from 1985 to 1995 for the dependent variable, the regressors and both instruments.
# subset data for year 1985
c1985 <- subset(CigarettesSW, year == “1985”) # define differences in variablespacksdiff <- log(c1995$packs) – log(c1985$packs)pricediff <- log(c1995$price/c1995$cpi) – log(c1985$price/c1985incomediff <- log(c1995$income/c1995$population/c1995$cpi) – log(c1985$income/c1985$population/c1985$cpi)salestaxdiff <- (c1995$taxs – c1995$tax)/c1995$cpi – (c1985$tax 183$cpi)s – c1985$ta cigtaxdiff <- c1995$tax/c1995$cpi – c1985$tax/c1985$cpiWe now perform three dierent IV estimations of (7.10) using ivreg():1. TSLS using only the dierence in the sales taxes between 1985 and 1995 as the instrument.2. TSLS using only the dierence in the cigarette-specific sales taxes 1985 and 1995 as the instrument.3. TSLS using both the dierence in the sales taxes 1985 and 1995 and the dierence in the cigarette-specific sales taxes 1985 and 1995 as instruments.# estimate the three modelscig_ivreg_diff1 <- ivreg(packsdiff ~ pricediff + incomediff | incomediff + salestaxdiff)cig_ivreg_diff2 <- ivreg(packsdiff ~ pricediff + incomediff | incomediff +cigtaxdiff)cig_ivreg_diff3 <- ivreg(packsdiff ~ pricediff + incomediff | incomediff +salestaxdiff + cigtaxdiff) As usual we use coeftest() in conjunction with vcovHC() to obtain robust coecient summaries for all models.184# robust coefficient summary for 1.coeftest(cig_ivreg_diff1, vcov = vcovHC, type = “HC1”) #>
#> t test of coefficients:
#>
#>Estimate Std. Error t valuePr(>|t|)
#> (Intercept) -0.117962 0.068217 -1.7292 0.09062 .
#> pricediff -0.938014 0.207502 -4.5205 4.454e-05 ***
#> incomediff 0.525970 0.3394941.5493 0.12832
#>
#> Signif. codes:0 *** 0.001 ** 0.01 * 0.05 . 0.1
# robust coefficient summary for 2.
coeftest(cig_ivreg_diff2, vcov = vcovHC, type = HC1) #>
#> t test of coefficients:
#>
#>Estimate Std. Error t valuePr(>|t|)
#> (Intercept) -0.017049 0.067217 -0.25360.8009
#> pricediff -1.342515 0.228661 -5.8712 4.848e-07 ***
#> incomediff 0.428146 0.2987181.43330.1587
#>
#> Signif. codes:0 *** 0.001 ** 0.01 * 0.05 . 0.1
# robust coefficient summary for 3.
coeftest(cig_ivreg_diff3, vcov = vcovHC, type = HC1) #>
#> t test of coefficients:
#>
#>Estimate Std. Error t valuePr(>|t|)
185
1
1

#> (Intercept) -0.052003 0.062488 -0.83220.4097
#> pricediff -1.202403 0.196943 -6.1053 2.178e-07 ***
#> incomediff 0.462030 0.3093411.49360.1423
#>
#> Signif. codes:0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
library(stargazer)
# gather robust standard errors in a list
rob_se <- list(sqrt(diag(vcovHC(cig_ivreg_diff1, type = “HC1”))),sqrt(diag(vcovHC(cig_ivreg_diff2, type = “HC1”))), sqrt(diag(vcovHC(cig_ivreg_diff3, type = “HC1”))))# generate tablestargazer(cig_ivreg_diff1, cig_ivreg_diff2, cig_ivreg_diff3, header = FALSE,type = “latex”,omit.table.layout = “n”,digits = 3,column.labels = c(“IV: salestax”, “IV: cigtax”,”IVs: salestax, cigtax”),dep.var.labels.include = FALSE,dep.var.caption =”Dependent Variable: 1985-1995 Difference in Log per Pack Price”,se = rob_se)Table 7.1 reports negative estimates of the coecient on pricediff that are quite dierent in magnitude. Which one should we trust? This hinges on186We proceed by generating a tabulated summary of the estimation results using stargazer(). Table 7.1: TSLS Estimates of the Long-Term Elasticity of the Demand for Cigarettes using Panel Datapricedi incomedi ConstantObservations R2Adjusted R2Residual Std. Error (df = 45)IV: salestax (1)=0.938uuu (0.208) 0.526 (0.339) =0.118u (0.068)48 0.550 0.530 0.091IV: cigtax (2)=1.343uuu (0.229) 0.428 (0.299) =0.017 (0.067)48 0.520 0.498 0.094IVs: salestax, cigtax (3)=1.202uuu (0.197) 0.462 (0.309) =0.052 (0.062)48 0.547 0.526 0.091Dep. variable: 1985-95 di in log price/pack the validity of the instruments used. To assess this we compute F -statistics for the first-stage regressions of all three models to check instrument relevance.# first-stage regressionsmod_relevance1 <- lm(pricediff ~ salestaxdiff + incomediff) mod_relevance2 <- lm(pricediff ~ cigtaxdiff + incomediff) mod_relevance3 <- lm(pricediff ~ incomediff + salestaxdiff +cigtaxdiff) # check instrument relevance for model (1)linearHypothesis(mod_relevance1, 187 “salestaxdiff = 0”,vcov = vcovHC, type = “HC1”) #> Linear hypothesis test
#>
#> Hypothesis:
#> salestaxdiff = 0
#>
#> Model 1: restricted model
#> Model 2: pricediff ~ salestaxdiff + incomediff
#>
#> Note: Coefficient covariance matrix supplied.
#>
#> Res.Df Df F Pr(>F)
#>1 46
#> 2 45 1 28.445 3.009e-06 ***
#>
#> Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
# check instrument relevance for model (2)
linearHypothesis(mod_relevance2, cigtaxdiff = 0,
vcov = vcovHC, type = HC1) #> Linear hypothesis test
#>
#> Hypothesis:
#> cigtaxdiff = 0
#>
#> Model 1: restricted model
#> Model 2: pricediff ~ cigtaxdiff + incomediff
#>
188

#> Note: Coefficient covariance matrix supplied. #>
#> Res.Df Df F Pr(>F)
#>1 46
#> 2 451 98.034 7.09e-13 ***
#>
#> Signif. codes:0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
# check instrument relevance for model (3)
linearHypothesis(mod_relevance3,
c(salestaxdiff = 0, cigtaxdiff = 0),
vcov = vcovHC, type = HC1) #> Linear hypothesis test
#>
#> Hypothesis:
#> salestaxdiff = 0
#> cigtaxdiff = 0
#>
#> Model 1: restricted model
#> Model 2: pricediff ~ incomediff + salestaxdiff + cigtaxdiff #>
#> Note: Coefficient covariance matrix supplied.
#>
#> Res.Df Df F Pr(>F)
#>1 46
#> 2 44 2 76.916 4.339e-15 ***
#>
#> Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
All F-statistics are larger than 10; so, the rule of thumb for detecting 189
1

weak instruments would suggest that the instruments are not weak.
Next, we also conduct the overidentifying restrictions test for model three which is the only model where the coecient on the dierence in log prices is
overidentified (m = 2, (K=1) = 1) such that the J-statistic can be computed. To do this we take the residuals stored in cig_ivreg_diff3 and regress them on both instruments and the presumably exogenous regressor incomediff. We again use linearHypothesis() to test whether the coecients on both instruments are zero which is necessary for the exogeneity assumption to be fulfilled. Note that with test = Chisq we obtain a chi-squared distributed test statistic instead of an F-statistic.
# compute the J-statistic
cig_iv_OR <- lm(residuals(cig_ivreg_diff3) ~ incomediff + salestaxdiff + cigtaxdiff)cig_OR_test <- linearHypothesis(cig_iv_OR, c(“salestaxdiff = 0”, cig_OR_test#> Linear hypothesis test
#>
#> Hypothesis:
#> salestaxdiff = 0
cigtaxdiff = 0),
test = Chisq)
#> cigtaxdiff = 0
#>
#> Model 1: restricted model
#> Model 2: residuals(cig_ivreg_diff3) ~ incomediff + salestax
#>
#> Res.Df RSS Df Sum of Sq Chisq Pr(>Chisq)
#> 1 46 0.37472
190
diff +
c

#> 2 44 0.3369520.037769 4.9320.08492 .
#>
#> Signif. codes:0 *** 0.001 ** 0.01 * 0.05 . 0.1
Caution: In this case the p-value reported by linearHypothesis() is wrong because the degrees of freedom are set to 2. This diers from the degree of overidentification (m = (K = 1) = 2 = (2 = 1) = 1) so the J -statistic is 21 distributed instead of following a 2 distribution as assumed defaultly by linearHypothesis(). We may compute the correct p-value using pchisq().
Since this value is smaller than 0.05 we reject the hypothesis that both instruments are exogenous at the level of 5%. This means one of the following:
1. The sales tax is an invalid instrument for the per-pack price.
2. The cigarettes-specific sales tax is an invalid instrument for the per-pack
price.
3. Both instruments are invalid.
One can argue that the assumption of instrument exogeneity is more likely to hold for the general sales tax such that the IV estimate of the long-run elasticity of demand for cigarettes we consider the most trustworthy is =0.94, the TSLS estimate obtained using the general sales tax as the only instrument.
The interpretation of this estimate is that over a 10-year period, an increase in the average price per package by one percent is expected to decrease consumption by about 0.94 percentage points. This suggests that, in the long run, price increases can reduce cigarette consumption considerably.
191
1
# compute correct p-value for J-statistic
pchisq(cig_OR_test[2, 5], df = 1, lower.tail = FALSE) #> [1] 0.02636406

Bibliography
Cribari-Neto, F. (2004). Asymptotic inference under heteroskedasticity of unknown form. Computational Statistics & Data Analysis, 45(2):215233.
Hayashi, F. (2000). Econometrics. Princeton University Press.
Lindsay, B. G. and Basak, P. (2000). Moments determine the tail of a distribution (but not much else). The American Statistician, 54(4):248 251.
Long, J. S. and Ervin, L. H. (2000). Using heteroscedasticity consistent standard errors in the linear regression model. The American Statistician, 54(3):217224.
MacKinnon, J. G. and White, H. (1985). Some heteroskedasticity-consistent covariance matrix estimators with improved finite sample properties. Jour- nal of Econometrics, 29(3):305325.
White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica, pages 817838.
193

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] CS AI algorithm Contents
$25