,

[SOLVED] Ams 274 homework 1 to 5 solutions

$25

File Name: Ams_274_homework_1_to_5_solutions.zip
File Size: 310.86 KB

Categories: , Tags: ,
5/5 - (1 vote)

1. The list below comprises a number of distributions, including in each case, the support,
parameter space, and density or probability mass function. Determine whether each of
the distributions belongs to the exponential dispersion family. Similarly for the twoparameter exponential family of distributions. In both cases, justify your answers.
(a) Double exponential (or Laplace) distribution.
f(y | θ, σ) = 1

exp 

|y − θ|
σ

y ∈ R, θ ∈ R, σ > 0.
(b) Uniform distribution.
f(y | θ, σ) = 1

θ − σ < y < θ + σ, θ ∈ R, σ > 0.
(c) Logistic distribution.
f(y | θ, σ) = exp((y − θ)/σ)
σ {1 + exp((y − θ)/σ)}
2
y ∈ R, θ ∈ R, σ > 0.
(d) Cauchy distribution.
f(y | θ, σ) = 1
πσ {1 + ((y − θ)/σ)
2}
y ∈ R, θ ∈ R, σ > 0.
(e) Pareto distribution.
f(y | α, β) = βαβ
y
β+1 y ≥ α, α > 0, β > 0.
(f) Beta distribution.
f(y | α, β) = y
α−1
(1 − y)
β−1
B(α, β)
0 ≤ y ≤ 1, α > 0, β > 0,
where B(α, β) = R 1
0
u
α−1
(1 − u)
β−1 du is the beta function.
(g) Negative binomial distribution.
f(y | α, p) = Γ(y + α)
Γ(α)y!
p
α
(1 − p)
y
y ∈ {0, 1, 2, …}, α > 0, 0 < p < 1,
where Γ(c) = R ∞
0
u
c−1
exp(−u) du is the gamma function.
2. Consider the linear regression setting where the responses Yi
, i = 1, …, n, are assumed
independent with means µi = E(Yi) = x
T
i β =
Pp
j=1 xijβj
for (known) covariates xij and
(unknown) regression coefficients β = (β1, …, βp)
T
.
(i) Show that if the response distribution is normal,
Yi
ind. ∼ f(yi
| µi
, σ) = (2πσ2
)
−1/2
exp 

(yi − µi)
2

2

, i = 1, …, n,
then the maximum likelihood estimate (MLE) of β is obtained by minimizing the L2-
norm,
S2(β) = Xn
i=1
(yi − x
T
i β)
2
.
(ii) Show that if the response distribution is double exponential,
Yi
ind. ∼ f(yi
| µi
, σ) = (2σ)
−1
exp 

|yi − µi
|
σ

, i = 1, …, n,
then the MLE of β is obtained by minimizing the L1-norm,
S1(β) = Xn
i=1
|yi − x
T
i β|.
(iii) Show that if the response distribution is uniform over the range [µi − σ, µi + σ],
Yi
ind. ∼ f(yi
| µi
, σ) = (2σ)
−1
, for µi − σ ≤ yi ≤ µi + σ, i = 1, …, n,
then the MLE of β is obtained by minimizing the L∞-norm,
S∞(β) = max
i
|yi − x
T
i β|.
(iv) Obtain the MLE of σ under each one of the response distributions in (i) – (iii) and
show that, in all cases, it is a function of the minimized norm.
3. Consider the special case of the Cauchy distribution, C(θ, 1), with scale parameter σ = 1,
and density function
f(y | θ) = 1
π{1 + (y − θ)
2}
y ∈ R, θ ∈ R,
where θ is the median of the distribution.
(a) Let y = (y1, …, yn) be a random sample from the C(θ, 1) distribution. Develop the
Newton-Raphson method and the method of scoring to approximate the maximum likelihood estimate of θ based on the sample y. (For the method of scoring, you can use the
result R ∞
0
(1 − x
2
)/{(1 + x
2
)
3} dx = π/8.)
(b) Consider a sample, assumed to arise from the C(θ, 1) distribution, with n = 9 and
y = (−0.774, 0.597, 7.575, 0.397, −0.865, −0.318, −0.125, 0.961, 1.039). Apply both
methods from (a) to estimate θ. To check your results, try a few different starting values
and also plot the likelihood function for θ.
(c) Now consider a sample (again, assumed to arise from the C(θ, 1) distribution) with
n = 3 and y = (0, 5, 9). Apply again the methods from (a) to estimate θ, using three
different starting values, θ
0 = −1, θ
0 = 4.67, θ
0 = 10. Comment on the results.
4. The data in the table below show the number of cases of AIDS in Australia by date of
diagnosis for successive 3-months periods from 1984 to 1988.
Quarter
Year 1 2 3 4
1984 1 6 16 23
1985 27 39 31 30
1986 43 51 63 70
1987 88 97 91 104
1988 110 113 149 159
Let xi = log(i), where i denotes the time period for i = 1, …, 20. Consider a GLM for this
data set based on a Poisson response distribution with mean µ, systematic component β1
+ β2xi
, and logarithmic link function g(µ) = log(µ).
(a) Fit this GLM to the data working from first principles, that is, derive the expressions that are needed for the scoring method, and implement the algorithm to obtain the
maximum likelihood estimates for β1 and β2.
(b) Use function “glm” in R to verify your results from part (a).

1. Let yi be realizations of independent random variables Yi with Poisson(µi) distributions, where
E(Yi) = µi
, for i = 1, …, n.
(a) Obtain the expression for the deviance for comparison of the full model, which assumes a
different µi for each yi
, with a reduced model defined by a Poisson GLM with link function g(·).
That is, under the reduced model, g(µi) = ηi = x
T
i β, where β = (β1, …, βp)
T
(with p < n) is
the vector of regression coefficients corresponding to covariates xi = (xi1, …, xip)
T
.
(b) Show that the expression for the deviance simplifies to 2 Pn
i=1 yi
log(yi/µˆi), for the special
case of the reduced model in part (a) with g(µi) = log(µi), and linear predictor that includes
an intercept, that is, ηi = β1 +
Pp
j=2 xijβj , for i = 1, …, n.
2. Let yi
, i = 1, …, n, be realizations of independent random variables Yi following gamma(µi
, ν)
distributions, with densities given by
f(yi
| µi
, ν) = (ν/µi)
νy
ν−1
i
exp(−νyi/µi)
Γ(ν)
, yi > 0; ν > 0, µi > 0,
where Γ(ν) = R ∞
0
t
ν−1
exp(−t)dt is the Gamma function.
(a) Express the gamma distribution as a member of the exponential dispersion family.
(b) Obtain the scaled deviance and the deviance for the comparison of the full model, which
includes a different µi for each yi
, with a gamma GLM based on link function g(µi) = x
T
i β,
where β = (β1, …, βp) (p < n) is the vector of regression coefficients corresponding to a set of p
covariates.
3. Consider the data set from:
https://www.stat.columbia.edu/~gelman/book/data/fabric.asc
on the incidence of faults in the manufacturing of rolls of fabric. The first column contains the
length of each roll (the covariate with values xi), and the second contains the number of faults
(the response with means µi).
(a) Use R to fit a Poisson GLM, with logarithmic link,
log(µi) = β1 + β2xi (1)
to explain the number of faults in terms of length of roll.
(b) Fit the regression model for the response means in (1) using the quasi-likelihood estimation
method, which allows for a dispersion parameter in the response variance function. (Use the
quasipoisson “family” in R.) Discuss the results.
(c) Derive point estimates and asymptotic interval estimates for the linear predictor, η0 = β1+
β2×0, at a new value x0 for length of roll, under the standard (likelihood) estimation method
from part (a), and the quasi-likelihood estimation method from part (b). Evaluate the point and
interval estimates at x0 = 500 and x0 = 995. (Under both cases, use the asymptotic bivariate
normality of (βˆ
1, βˆ
2) to obtain the asymptotic distribution of ˆη0 = βˆ
1+ βˆ
2×0.)
4. This problem deals with data collected as the number of Ceriodaphnia organisms counted in a
controlled environment in which reproduction is occurring among the organisms. The experimenter places into the containers a varying concentration of a particular component of jet fuel
that impairs reproduction. It is anticipated that as the concentration of jet fuel grows, the number of organisms should decrease. The problem also includes a categorical covariate introduced
through use of two different strains of the organism.
The data set is available from the course website
https://ams274-fall16-01.courses.soe.ucsc.edu/node/4
where the first column includes the number of organisms, the second the concentration of jet
fuel (in grams per liter), and the third the strain of the organism (with covariate values 0 and 1).
Build a Poisson GLM to study the effect of the covariates (jet fuel concentration and organism
strain) on the number of Ceriodaphnia organisms. Use graphical exploratory data analysis to
motivate possible choices for the link function and the linear predictor. Use classical measures of
goodness-of-fit and model comparison (deviance, AIC and BIC), as well as Pearson and deviance
residuals, to assess model fit and to compare different model formulations. Provide a plot of the
estimated regression functions under your proposed model.

1. The table below reports results from a toxicological experiment, including the number of beetles killed (yi) after 5 hours exposure to gaseous carbon disulphide at various concentrations.
Concentration (log dose, xi) is given on the log10 scale.
Log Dose, xi Number of beetles, mi Number killed, yi
1.6907 59 6
1.7242 60 13
1.7552 62 18
1.7842 56 28
1.8113 63 52
1.8369 59 53
1.8610 62 61
1.8839 60 60
Consider a binomial response distribution, and assume that the yi are independent realizations
from Bin(mi
, πi), i = 1,…,n. The objective is to study the effect of the choice of link function
g(·), where πi = g
−1
(ηi) = g
−1
(β1 + β2xi).
(a) Using R, fit 3 binomial GLMs for these data corresponding to 3 link functions, logit, probit
and complementary log-log. Perform residual analysis for each model, using the deviance residuals. Obtain fitted values, ˆπi
, under each model and compare with observed proportions, yi/mi
.
Obtain the estimated dose-response curve under each model by evaluating ˆπ(x) = g
−1
(βˆ
1 +βˆ
2x)
over a grid of values x for log dose in the interval (1.65, 1.9). Plot these curves and compare
with the scatter plot of the observed xi plotted against the observed proportions. Based on all
the results above, discuss the fit of the different models.
(b) One of the more general (parametric) link functions for binomial GLMs that has been
suggested in the literature is defined through
g
−1
α (ηi) = exp(αηi)
{1 + exp(ηi)}
α
for α > 0. (1.1)
Note that the logit link arises as a special case of (1.1), when α = 1. Discuss the effect of the
additional model parameter α, in particular, for values 0 < α < 1 and α > 1. Provide the
expression for the log-likelihood for β1, β2 and α under the link in (1.1), and discuss the complications that arise for maximum likelihood estimation under this more general model compared
with the logit GLM. (You do not need to fit the model, estimates are given below.)
(c) The MLEs under the model with link given in (1.1) are βˆ
1 = −113.625, βˆ
2 = 62.5 and
αˆ = 0.279. (The MLEs can be obtained using the Newton-Raphson method.) Using these estimates, obtain the fitted values ˆπi and the estimated dose-response curve under the link (1.1).
Compare with the corresponding results under the 3 models in (a). Obtain the deviance residuals from the model with link (1.1) and analyze them graphically.
(d) Compute the AIC and BIC for the 4 models considered above to compare them.
2. This problem involves Bayesian analysis of the beetle mortality data from the previous problem.
(a) Consider a Bayesian binomial GLM with a complementary log-log link, i.e., assume that,
given β1 and β2, the yi are independent from Bin(mi
, π(xi)), i = 1,…,8, where
π(x) ≡ π(x; β1, β2) = 1 − exp{− exp(β1 + β2x)}.
Design and implement an MCMC method to sample from the posterior distribution of (β1, β2).
Study the effect of the prior for (β1, β2), for example, consider a flat prior as well as (independent) normal priors. Under the flat prior, obtain the posterior distribution for the median
lethal dose, LD50, that is, the dose level at which the probability of response is 0.5. Finally, plot
point and interval estimates for the dose-response curve π(x) (over a grid of values x for log dose).
(b) Next, consider a binomial GLM with a logit link, i.e., now the yi are assumed independent, given β1 and β2, from Bin(mi
, π(xi)), i = 1,…,8, where
π(x) ≡ π(x; β1, β2) = exp(β1 + β2x)/{1 + exp(β1 + β2x)}.
Working with a flat prior for (β1, β2), obtain MCMC samples from the posterior distributions
for β1, β2, and for LD50, along with point and interval estimates for the dose-response curve π(x).
(c) As a third model, consider the binomial GLM with the parametric link given in (1.1).
Develop an MCMC method to sample from the posterior distribution of (β1, β2, α), and obtain
the posterior distribution for LD50, and point and interval estimates for π(x).
(d) Use the results from parts (a), (b) and (c) for an empirical comparison of the three Bayesian
binomial GLMs for the beetle mortality data. Moreover, perform a residual analysis for each
model using the Bayesian residuals: (yi/mi)−π(xi
; β1, β2) for the first two models, and (yi/mi)−
π(xi
; β1, β2, α) for the third. Finally, use the quadratic loss L measure for formal comparison of
the three models.

1. The table below reports results from a developmental toxicity study involving ordinal
categorical outcomes. This study administered diethylene glycol dimethyl ether (an industrial solvent used in the manufacture of protective coatings) to pregnant mice. Each
mouse was exposed to one of five concentration levels for ten days early in the pregnancy
(with concentration 0 corresponding to controls). Two days later, the uterine contents of
the pregnant mice were examined for defects. One of three (ordered) outcomes (“Dead”,
“Malformation”, “Normal”) was recorded for each fetus.
Concentration Response Total number
(mg/kg per day) Dead Malformation Normal of subjects
(xi) (yi1) (yi2) (yi3) (mi)
0 15 1 281 297
62.5 17 0 225 242
125 22 7 283 312
250 38 59 202 299
500 144 132 9 285
Build a multinomial regression model for these data using continuation-ratio logits for the
response probabilities πj (x), j = 1, 2, 3, as a function of concentration level, x. Specifically,
consider the following model
L
(cr)
1 = log 
π1
π2 + π3

= α1 + β1x; L
(cr)
2 = log 
π2
π3

= α2 + β2x
for the multinomial response probabilities πj ≡ πj (x), j = 1, 2, 3.
(a) Show that the model, involving the multinomial likelihood for the data = {(yi1, yi2, yi3, xi) :
i = 1, …, 5}, can be fitted by fitting separately two Binomial GLMs. Provide details for
your argument, including the specific form of the Binomial GLMs.
(b) Use the result from part (a) to obtain the MLE estimates and corresponding standard errors for parameters (α1, α2, β1, β2). Plot the estimated response curves ˆπj (x), for
j = 1, 2, 3, and discuss the results.
(c) Develop and implement a Bayesian version of the model above. Discuss your prior
choice, and provide details for the posterior simulation method. Provide point and interval
estimates for the response curves πj (x), for j = 1, 2, 3.
2. Consider the “alligator food choice” data example, the full version of which is discussed
in Section 7.1 of Agresti (2002), Categorical Data Analysis, Second Edition. Here, consider the subset of the data reported in Table 7.16 (page 304) of the above book. This
data set involves observations on the primary food choice for n = 63 alligators caught in
Lake George, Florida. The nominal response variable is the primary food type (in volume) found in each alligator’s stomach, with three categories: “fish”, “invertebrate”, and
“other”. The invertebrates were mainly apple snails, aquatic insects, and crayfish. The
“other” category included amphibian, mammal, bird, reptile, and plant material. Also
available for each alligator is covariate information on its length (in meters) and gender.
(a) Focus first on length as the single covariate to explain the response probabilities
for the “fish”, “invertebrate” and “other” food choice categories. Develop a Bayesian
multinomial regression model, using the baseline-category logits formulation with “fish”
as the baseline category, to estimate (with point and interval estimates) the response
probabilities as a function of length. (Note that in this data example, we have mi = 1,
for i = 1, …, n.) Discuss your prior choice and approach to MCMC posterior simulation.
(b) Extend the model from part (a) to describe the effects of both length and gender
on food choice. Based on your proposed model, provide point and interval estimates for
the length-dependent response probabilities for male and female alligators.
3. Consider the inverse Gaussian distribution with density function
f(y | µ, φ) = (2πφy3
)
−1/2
exp 

(y − µ)
2
2φµ2y

, y > 0; µ > 0, φ > 0.
Denote the inverse Gaussian distribution with parameters µ and φ by IG(µ, φ).
(a) Show that the inverse Gaussian distribution is a member of the exponential dispersion
family. Show that µ is the mean of the distribution and obtain the variance function.
(b) Consider a GLM with random component defined by the inverse Gaussian distribution. That is, assume that yi are realizations of independent random variables Yi with
IG(µi
, φ) distributions, for i = 1,…,n. Here, g(µi) = x
T
i β, where β = (β1, …, βp) (p < n)
is the vector of regression coefficients, and xi = (xi1, …, xip)
T
is the covariate vector for
the ith response, i = 1,…,n. Define the full model so that the yi are realizations of independent IG(µi
, φ) distributed random variables Yi
, with a distinct µi
for each yi
. Obtain
the scaled deviance for the comparison of the full model with the inverse Gaussian GLM.

Consider the data set from homework 2, problem 3 on the incidence of faults in the manufacturing
of rolls of fabric:
https://www.stat.columbia.edu/~gelman/book/data/fabric.asc
where the first column contains the length of each roll, which is the covariate with values xi
,
and the second column contains the number of faults, which is the response with values yi and
means µi
.
(a) Fit a Bayesian Poisson GLM to these data, using a logarithmic link, log(µi) = β1 + β2xi
.
Obtain the posterior distributions for β1 and β2 (under a flat prior for (β1, β2)), as well as point
and interval estimates for the response mean as a function of the covariate (over a grid of covariate values). Obtain the distributions of the posterior predictive residuals, and use them for
model checking.
(b) Develop a hierarchical extension of the Poisson GLM from part (a), using a gamma distribution for the response means across roll lengths. Specifically, for the second stage of the
hierarchical model, assume that
µi
| γi
, λ ind. ∼
1
Γ(λ)

λ
γi

µ
λ−1
i
exp 

λ
γi
µi

µi > 0; λ > 0, γi > 0,
where log(γi) = β1 + β2xi
. (Here, Γ(u) = R ∞
0
t
u−1
exp(−t)dt is the Gamma function.)
Derive the expressions for E(Yi
| β1, β2, λ) and Var(Yi
| β1, β2, λ), and compare them with the
corresponding expressions under the non-hierarchical model from part (a). Develop an MCMC
method for posterior simulation providing details for all its steps. Derive the expression for the
posterior predictive distribution of a new (unobserved) response y0 corresponding to a specified
covariate value x0, which is not included in the observed xi
. Implement the MCMC algorithm
to obtain the posterior distributions for β1, β2 and λ, as well as point and interval estimates for
the response mean as a function of the covariate (over a grid of covariate values). Discuss model
checking results based on posterior predictive residuals.
Regarding the priors, you can use again the flat prior for (β1, β2), but perform prior sensitivity analysis for λ considering different proper priors, including p(λ) = (λ + 1)−2
.
(c) Based on your results from parts (a) and (b), provide discussion on empirical comparison between the two models. Moreover, use the quadratic loss L measure for formal comparison
of the two models, in particular, to check if the hierarchical Poisson GLM offers an improvement to the fit of the non-hierarchical GLM. (Provide details on the required expressions for
computing the value of the model comparison criterion.)

 

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] Ams 274 homework 1 to 5 solutions[SOLVED] Ams 274 homework 1 to 5 solutions
$25