[Solved] Advanced stats – LAB_ 1

30 $

SKU: [Solved] Advanced stats – LAB_ 1 Category: Tag:

ML estimation with PDF

Theoretical analysis

Question 1: maximum likelihood estimator?

For n iid observations xi of the height of the river, the likelihood can be written as following

L(a;x1,…,xn) = ~n fH(xi)

i=1

~ in

e~ 1 i=1 x2

2a i

xi

The log-likelihood can be derived from the likelihood as follows

l(a; x1, …, xn) = log(L(a; x1, …, xn))

1 ~

ln(xi) — nln(a) — 2a

i=1

n

Dl n 1 ~

Da(a; x1, …, xn) = —

a 2a2

i=1

Then one can derive the maximum likelihood estimator by setting the partial derivative to 0.

Dl

Da(~an; x1, …, xn) = 0 ~~

Question 2: method of moments estimator?

~E[X] = xfH(x)dx

~ 0 +0′) x2 a_ e_x2

= 2a dx

~ 0 +~ e_x2

= 2a dx

0

4/7ra

= 2

One can estimate the expectation using the arithmetic mean. Hence, the method of moments estimator ~an is:

1 n n i=1
~~~an xi = 2 ~~

Question 3: properties of ~an? a) Unbiased?

~n1E[~an] = 2ni=1 E[X2i ]
= 12 E[X2]

1 f00 3

= I ea dx

2j0 a

~

+00= xe 2a dx 0

E[~an] = a

~an is unbiased.

  1. b) Optimal? Let us derive the variance of the estimator ~an.

V ar[~an] =

~4n2

i=1

Var[X2]

1 ~E[X4] ~ E[X2]2~ = 4n

a

(

We know from question a) that E[X2] =

1 Thus,V ar[

n] = ____

4aE[X2]

E[X2]2

4n

2E[~an] = 2a.

a2V ar[~an] =n

Let us now compute the Fisher information I(a)

a2 n 1 n ~

~a2_____ l(a; x1~ ~~~~ xn)= a2 a3

i=1

= E[~Da2l(a;x1,~~~~xn)]( — 1

= a2 aIn(a) = n

a2

a2

In(a)

n

1E[X2])

~ +00 x5

E[X4] = a e~x2

2a dx

0= 4 f+00x3edx

0= 4aE[X2]

)

2

1V ar[~an] = In(a)

Its variance equals the Cramer–Rao lower bound and it is unbiased. Hence, ~an minimizes the mean squared error. So ~an is both optimal.

  1. Efficient? Since ~an is unbiased and optimal. Therefore, ~an is efficient because its variance is equal to the Cramer-Rao lower bound.
  2. Asymptotically Gaussian? The maximum likelihood estimator is asymptotically gaussian. Hence, ~an is asymptotically gaussian. I(a) = In(a)

n = a2 1

/n(~an — a) d~~~~~n—boo N(0, a2)

Application on real data

Question 1: p function of a?

Let p the probability that a disaster happens during one year.

p = 1 – FH(6)

~ ~ x a e~ x2= 2a dx

~ 6 – e x2 ~~

2a

=

6

18

p = e~ a

Question 2: probability of at most one disaster?

During one thousand years, if at most one disaster happened, it means either there was no disasters, or therewas only one. Let us derive p1, the probability that at most one disaster happens during one thousand years.

p1 = (1 – p)999 = (1 ~ e~ 18 a )999

Question 3: estimation of the probability?

X = c(2.5, 1.8, 2.9, 0.9, 2.1, 1.7, 2.2, 2.8) n = length(X)a = sum(X~2)/(2*n)p = (1-exp(-18/a))~999

Regarding the set of 8 observations, one can estimate a~ = 2.42. The probability p1 can be estimated:

p1 = 0.557 .

Exercise 1: Rayleigh distribution

(a)

The parameter a of the Rayleigh distribution was estimated with the maximum likelihood estimator ~an. It was found that a 2.42.

(b)

implemented in R. One has to be careful that the scale a used in R corresponds to i/a.

One can generate more samples following a Rayleigh distribution by using the Rayleigh distribution function

n = 100000

X = rrayleigh(n, scale=sqrt(a))

hist(X, nclass=50, xlim = c(0,10), main = “Rayleigh implemented in R”)

Rayleigh implemented in R

0 2 4 6 8 10

X

If one has only access to uniform distribution and would like to output a Rayleigh distribution, one can use

the inverse distribution function.

~ x

ae~ t2

t

F (x) = 2a dt

0

= 1 – e~ x22a

/

F (x) = u ~~ x = -2a ln(1 – u)

If U follows a uniform distribution on [0, 1], one can generate samples following the Rayleigh distribution

using the uniform distribution.

/

U U[0, 1] =~ -2a ln(U) Rayleigh(a)

n = 100000

U = runif(n)

X = sqrt(-2*a*log(U))

hist(X, nclass=50, xlim = c(0,10), main = “Simulated Rayleigh”)

Simulated Rayleigh

0 2 4 6 8 10

X

(c)

Empirically, one can verify that the MLE is unbiased by averaging N samples of the MLE ~an,1, …, ~an,N with whatever value for n. For computing resources reasons, let’s take n = 10 and average over N = 100000 samples of n observations.

>JN1 E[~an — a]Nk=1 (~an,k a)
= 1N >JNk=1 >Jn( 12ni=1 x2 i~k a)
N = 100000n = 10E = 0for (k in 1:N){X = rrayleigh(n, scale=sqrt(a))E = E + 1/(2*n)*sum(X~2) – a}E = E/NE

## [1] -0.002336315

E[~a,—a]a ~ —9.7 x 10—4 << 1. Hence, empirically, the estimator is unbiased.

Empirically, one can verify the efficiency of the MLE estimator by computing its variance and compare it to the inverse of the Fisher information. One needs an unbiased estimator of the variance, knowing that the mean is a.

XN1 V ar[~an]Nk=1 (~an~k a)2
= 1N XNk=1 Xn( 12ni=1 x2 i~k a )2
N = 100000 n = 10I = n/a~2 # Fisher informationV = 0for (k in 1:N){X = rrayleigh(n, scale=sqrt(a))dV = 1/(2*n)*sum(X~2) – aV = V + dV~2}V = V/NV

## [1] 0.5887242

1 a2 ~ 0.5847
In(a) n
V ar[~an] – 1 I,(a)1 I,(a) ~ 0.00683 << 1

Hence, one can say that V ar[~an] = 1

I,(a) and the estimator is efficient empirically.

v’n(~an — a) d~~~~~n—boo N(0, a2)

The asymptotic normality means that for n large, /n(~ana) N(0, a2). Thus one can plot several samples of the random variable Zn = ~n(~ana) and check whether the distribution looks Gaussian.

n = 10000 # Size of the observations for each a_n N = 10000 # Number of samples of Z_n

Z_n = rep(0,N)

for (k in 1:N){

X = rrayleigh(n, scale=sqrt(a))

a_n = 1/(2*n)*sum(X~2)

Z_n[k] = sqrt(n)*(a_n – a)

}

hist(Z_n, breaks=40, main = “Asymptotic normality”)

Asymptotic normality

−5 0 5 10

Z_n

The MLE estimator is asymptotically normal. One can verify the standard deviation a = 2.418.

ML estimation with PMF

Statistical modelling and theoretical analysis

Question 1: Belonging to exponential family

We study the random variable X that follows a geometric distribution with parameter q E ]0; 1[. We have :Vk E N*, P(X = k) = q(1 _ q)k_1

The likelihood function can be written as :

L(x,q) = 1N*(x) x q(1 _ q)x_1

= 1N*(x) x q x (x_1) ln(1_q)

= 1N*(x) x q

1 q x xln(1_q)

We can notice that the model is dominated and the distribution domain where L(x, q) > 0 is N* which does not depend on q. Thus the distribution domain is homogeneous. We then define:

We can then write the likelihood like :

L(x, q) = h(x)φ(q) exp (Q(q)S(x))

We can conclude that a geometric distribution belongs to the exponential family and S is a sufficient statistic.Since S is linearly independent with itself, we also deduce that the model is identifiable. Question 2: Computation of the Fisher Information Matrix

We saw in question 1 that the model was dominated and the distribution domain was homogeneous. We can also easily show that L(x, q) is twice differentiable for variable q and integrable, since it is a polynomial function. Thus the model is regular. We note l(x, q) the log-likelihood of the model:

Vx E N* l(xq) = ln(q(1 – q)x_1)

= ln(q) + (x — 1) ln(1 — q)

We can now deduce the score function:

Vx E N*,sq(x) = al(x,q)

aq

1

=

q We can note that the score function is an affine transform of X, thus it is square-integrable because X is, so the Fisher Information Matrix is well-defined. We showed previously that the model was regular, thus we have:

I(q) = Eq (sq(X)2)

= Eq ( aq2 2___ l(X,q))

Eq (1 X 1 q2 (1 _q)2)Eq(X)_1 1

and E(X) =

(1—q)2 q

1q
—1
=
1q2 +
(1—q)2

(1—q)2 q—q2

q2(1 – q)2 + q2(1 – q)2

1q2(1 – q)

Question 3: Maximum likelihood estimator

Let X1, …, Xn a n-sample following the same distribution as X. The likelihood of the model is :

L(x1,…,xn,q) = ~n q(1 – q)xi_1

i=1

( q (1

q)

—q)

n

= nln ( 1____ —q) + ln(1 — q) xi

i=1

We look for ~qn that maximizes the likelihood of the n-sample. Thus it satisfies two equations :

aq

a2

___ l(x1,…,xn,~qn) < 0 (2)

From equation (1) we deduce:

~ 1 __________ ~ ~n

+ 1 i=1__ xi

n ~ = 0

~qn 1 ~ ~qn ~qn

~n

n i=1 xi

~qn(1 —~qn) 1 —

~~ n = ~qn ni=1 xi

1 =1

~~ ~qn n

The maximum likelihood estimator qn is

Question 4: Asymptotic behavior of the estimator

Now we are going to study the asymptotic behavior of the estimator. According to the Central limit theorem, we have:

n (Xn —(0,V (X)) where V (X) = 1 — q

q ) q2

~ ~ ~ ~

L

~~ ~n Xn ~ 1 0, 1 ~ q

~~ N __

q q2

Then, we use the delta method. We define: g : x 1 which is differentiable in1. We have:

x q

n(g (Xn)g(1))L 0‘g,()2 1 —q)

N

q qq2

~~~n(~qn_q)L~*N(0,(1_q)q2) ~~ vn(~qn—q) L-~N(0,I~1(q))

This estimator is asymptotically normal and its asymptotic variance is the Cramer Rao bound, thus the estimator is asymptotically efficient. This was expected because this is a maximum likelihood estimator.

Question 5: Asymoptotic confidence interval

Finally, we build an asymptotic confidence interval for q. On pose:

Xn = 1 n Xi

n i=1

Then, we know that : nq t(n – 1), where t is the student law. Since the law is symmetric, we can

Sn )

~n(Xn~ write: ~tn~1

~~2 ~ Sn ~ tn~1

q ) ~~2 where tk ~is the unique real number that verifies P(t(k) < tk~) = 1 – a.

Finally, we have:

tn~1

~/2 ~

Sn 1

~~ ~ ~ntn~1 ~ Sn

~~2 ~ Xn ~ ~ntn~1

~~2

q

1~~ _______________________________ ~ q ~Xn + Sn~ntn~1~~2 1
X Sntn1n – Vn a/2

Application on real data

Question 1: Estimation of the fraud probability

X = c(44, 9, 11, 59, 81, 19, 89, 10, 24, 07, 21, 90, 38, 01, 15, 22, 29, 19, 37, 26, 219, 2, 57, 11, 34
n = length(X) p = (n)/sum(X)

The probability of fraud pfraud can be estimated with the estimator studied above: pfraud = 0.026 . We take 1 – ~= 0.95, we can deduce a confidence interval:

t_alpha = 2.021 #quantile for student law for n = 40, closest value available.

Xn=mean(X)

Sn=sqrt(mean((X-Xn)**2)*n/(n-1))

a = 1/(Xn+Sn*t_alpha/sqrt(n))

b = 1/(Xn-Sn*t_alpha/sqrt(n))

With a 95% confidence, we know that pfraud is between 0.019 and 0.043 . If we have n0 = 20000 validatedtickets, we can estimate the number nfraud of fraudsters. For 1000 users, there are 26 fraudsters and 974

honest users. Thus we have n0 = 534 fraudsters.
1 – pfraud

Exercise 2: Geometric distribution

The parameter pfraud of the geometric distribution was estimated with the maximum likelihood estimator ~pn. It was found that pfraud ~0.026 .

One has only access to uniform distribution and would like to output a geometric distribution. We start by randomly drawing a number q between 0 and 1, according to the uniform distribution. Then, we realize the following segmentation of the interval [0, 1]:

+ook—1 k

1 I I ( i—1 ( i—1

[

0, 1] = U pf raud1 pfraud) , pf raud1 pfraud)

k=1 i=1 i=1

= +~~ [1 —(1 _ pfraud)k 1~1 —(1 _ pfraud)k] k=1

Thus, if we draw q, we look for k that satisfies:

( k-1 (

1 1 pfraud) q 1 -1 —pfraud)

~~ —(1 pfraud)k1 < q— 1< —(1 pfraud)k ~~ (1 pfraud)k < 1 q <(1 pfraud)k1

~~ kln(1 _ pfraud) <ln(1—q) < (k-1)ln(1—pfraud) ~~ k-1 < ln(1—q) <k

ln(1 – pfraud)

Since the probability to draw an integer is 0, we can choose k = [ln(1q)

ln(1pfraud)1

If U follows a uniform distribution on [0, 1], one can generate samples following the geometric distribution using the uniform distribution.

U ~ U[0, 1] =~ ~ ln(1 ~ U) ln(1—pfraud)1 ~G(pfraud)

n = 100000

U = runif(n)

X = ceiling(log(1-U)/log(1-p))

hist(X, nclass=800, xlim = c(0,200), main = “Simulated Geometric”)

Simulated Geometric

0 50 100 150 200

X

(c)

We have shown that:

n(~qn~q) d~~N(0,q2(1_q))

The asymptotic normality means that for n large, Jn(~ana) N(0, a2). Thus one can plot several samples of the random variable Zn = ~n(~qnq) and check whether the distribution looks Gaussian.

n = 10000 # Size of the observations for each q_n N = 10000 # Number of samples of Z_nZ_n = rep(0,N) for (k in 1:N) {U = runif(n)X = ceiling(log(1-U)/log(1-p))p_n = 1/mean(X)Z_n[k] = sqrt(n)*(p_n – p)}hist(Z_n, breaks=100, main = “Asymptotic normality”)

Asymptotic normality

−0.10 −0.05 0.00 0.05 0.10

Z_n

var_emp = mean(Z_n**2) var_theo = p~2*(1-p)

The MLE estimator is asymptotically normal.An estimation of the variance gives a = 6.72 x 10~4, which is close to the value : p2fraude * (1 pfraude) = 6.67 X 10−4

(d)We have shown that a 95% confidence interval is :1________________________________________________________________________________________________________________________________________________ – q Xn + S,~ntn1~~2 1
XnS,~ntn1~~2

1________________________ 1

For a 95% confidence interval. On obtain: ~n . We use the 39 values given and

Xn+2.021_ Sn ~n < q < Xn-2.021 Sn

we simulated 5000 times a 39 dataset, we estimate q and compute the % of q in the confidence interval.

n = 39 # Size of the observations for each q_n N = 50000 # Number of samples of Z_ncount = 0q_n = rep(0,N)for (k in 1:N){U = runif(n)X = ceiling(log(1-U)/log(1-p))p_n = 1/mean(X)q_n[k] = p_nif (a<p_n & p_n<b){count = count + 1}}hist(q_n, breaks=100, main = “Confidence interval”)

Confidence interval

0.02 0.03 0.04 0.05
q_n

We obtain that 0.986 of the simulations give an estimation of q in the confidence interv

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[Solved] Advanced stats – LAB_ 1
30 $