Make sure that you upload the PDF (or HTML) output after you have knitted the file. The files you upload to the Canvas page should be updated with commands you provide to answer each of the questions below. You can edit this file directly to produce your final solutions.
Goals
This lab has two goals. The first goal is to use the Accept-Reject algorithm to simulate from a mixture of two normals. The second goal is to utilize Bayesian methods and the the famous Markov Chain Monte Carlo algorithm to estimate the mixture parameter .
Background: (Mixture)
A mixture distribution is the probability distribution of a random variable that is derived from a collection of other random variables (Wiki). In our case we consider a mixture of two normal distributions. Here we assume that our random variable is governed by the probability density f(x), defined by
f(x) = f(x;1,1,2,2,)
,
where < x < and the parameter space is defined by < 1,2 < , 1,2 > 0, and 0 1.
The mixture parameter governs how much mass gets placed on the first distribution f(x;1,1) and the complement of governs how much mass gets placed on the other distribution f2(x;2,2).
To further motivate this setting, consider simulating n = 10,000 heights from the population of both males and females. Assume that males are distributed normal with mean 1 = 70[in] and standard deviation 1 = 3[in] and females are distributed normal with mean 2 = 64[in] and standard deviation 2 = 2.5[in]. Also assume that each distribution contributes equal mass, i.e., set the mixture parameter to = .5. The distribution of males is governed by
f,
and the distribution of females is governed by
f.
Below shows the pdf of f1(x;1,1), f2(x;2,2) and the mixture f(x) all on the same plot.
x <- seq(45,90,by=.05) n.x <- length(x) f_1 <- dnorm(x,mean=70,sd=3) f_2 <- dnorm(x,mean=64,sd=2.5) f <- function(x) { return(.5*dnorm(x,mean=70,sd=3) + .5*dnorm(x,mean=64,sd=2.5))}plot_df <- data.frame(x=c(x,x,x),Density=c(f_1,f_2,f(x)),Distribution=c(rep(Male,n.x),rep(Female,n.x),rep(Mixture,n.x)))library(ggplot2) ggplot(data = plot_df) + geom_line(mapping = aes(x = x, y = Density,color=Distribution))+ labs(title = Mixture of Normals) |
Mixture of Normals
Part I: Simulating a Mixture of Normals
The first goal is to simulate from the mixture distribution
f1(x;1,1) + (1 )f2(x;2,2),
where 1 = 70,1 = 3,2 = 64,2 = 2.5, = .5. We use the accept-reject algorithm to accomplish this task.
First we must choose the easy to simulate distribution g(x). For this problem choose g(x) to be a Cauchy distribution centered at 66 with scale parameter 7.
g <- function(x) { s=7 l=66 return(1/(pi*s*(1+((xl)/s)^2)))} |
Perform the following tasks
1) Identify a suitable value of alpha such that your envelope function e(x) satisfies
f(x) e(x) = g(x)/, where 0 < < 1.
Note that you must choose so that e(x) is close to f(x). There is not one unique solution to this problem. The below plot shows how = .20 creates an envelope function that is too large. Validate your choice of alpha with with a graphic similar to below.
# Choose alpha alpha <- .20# Define envelope e(x) e <- function(x) { return(g(x)/alpha) }# Plotx.vec <- seq(30,100,by=.1) ggplot() + geom_line(mapping = aes(x = x.vec, y = f(x.vec)),col=purple)+ geom_line(mapping = aes(x = x.vec, y = e(x.vec)),col=green) |
40 60 | 80 100x.vec |
# Is g(x)>f(x)?all(e(x.vec)>f(x.vec)) |
## [1] TRUE
Solution
## Solution goes here –
2) Write a function named r.norm.mix() that simulates n.samps from the normal-mixture f(x). To accomplish this task you will wrap a function around the accept-reject algorithm from the lecture notes. Also include the acceptance rate, i.e., how many times did the algorithm accept a draw compared to the total number of trials performed. Your function should return a list of two elements: (i) the simulated vector mixture and (ii) the proportion of accepted cases. Run your function r.norm.mix() to simulate 10,000 cases and display the first 20 values. Whats the proportion of accepted cases? Compare this number to your chosen and comment on the result. The code below should help you get started.
Solution
r.norm.mix <- function(n.samps) {#n <- 0 # counter for number samples accepted#m <- 0 # counter for number of trials#samps <- numeric(n.samps) # initialize the vector of output#while (n < n.samps) {# Fill in body –#}#return(list(x=samps,alpha.hat=n.samps/m))} |
#r.norm.mix(n.samps=10000)$alpha.hat
#alpha
3) Using ggplot or base R, construct a histogram of the simulated mixture distribution with the true mixture pdf f(x) overlayed on the plot.
Solution
## Solution goes here –
Part II: Bayesian Statistics and MCMC
Suppose that the experimenter collected 100 cases from the true mixture-normal distribution f(x). To solve problems (4) through (8) we analyze one realized sample from our function r.norm.mix(). In practice this dataset would be collected and not simulated. Uncomment the below code to simulate our dataset x. If you failed to solve Part I, then read in the csv file mixture_data.csv posted on Canvas.
Solution
# Simulate data
#set.seed(1983)
#x <- r.norm.mix(n.samps=100)$x
#head(x)
#hist(x,breaks=20,xlab=X,main=)
# Or read data x <- read.csv(mixture_data.csv)$x head(x)
## [1] 71.66666 63.91096 67.06554 65.49516 70.34363 65.69982
hist(x,breaks=20,xlab=X,main=)
X
Further, suppose that we know the true heights and standard deviations of the two normal distributions but the mixture parameter is unknown. In this case, we know 1 = 70, 1 = 3, 2 = 64, 2 = 2.5. The goal of this exercise is to utilize maximum likelihood and MCMC Bayesian techniques to estimate mixture parameter .
Maximum likelihood Estimator of Mixture Parameter
4) Set up the likelihood function L(|x1,,x100) and define it as mix.like(). The function should have two inputs including the parameter delta and data vector x. Evaluate the likelihood at the parameter values delta=.2, delta=.4, and delta=.6. Note that all three evaluations will be very small numbers. Which delta ( = .2,.4,.6) is the most likely to have generated the dataset x?
Solution
## Solution goes here –
5) Compute the maximum likelihood estimator of mixture parameter . To accomplish this task, apply your likelihood function mix.like() across the vector seq(.1,.99,by=.001). The solution to this exercise is given below.
# delta <- seq(.1,.99,by=.001)
# MLE.values <- sapply(delta,mix.like,x=x)
# delta.MLE <- delta[which.max(MLE.values)]
# plot(delta,MLE.values,ylab=Likelihood,type=l)
# abline(v=delta.MLE,col=blue)
# text(x=delta.MLE+.08,y=mix.like(delta=.45,x=x),paste(delta.MLE),col=blue)
MCMC
6) Run the Metropolis-Hastings algorithm to estimate mixture parameter . In this exercise you will assume a Beta( = 10, = 10) prior distribution on mixture parameter . Some notes follow:
- Run 20000 iterations. I.e., simulate 20000 draws of (t)
- Proposal distribution Beta( = 10, = 10) Independence chain with Metropolis-Hastings ratio:
R((t),) = LL(((t)||xx11,,x,,x100100))
Display the first 20 simulated cases of (t).
Solution
## Solution goes here –
7) Construct a lineplot of the simulated Markov chain from exercise (6). The vertical axis is the simulated chain (t) and the horizontal axis is the number of iterations.
Solution
## Solution goes here –
8) Plot the empirical autocorrelation function of your simulated chain (t). I.e., run the function acf(). A quick decay of the chains autocorrelations indicate good mixing properties.
Solution
#acf(delta_t_vec,main=ACF: Prior Beta(10,10))
9) Compute the empirical Bayes estimate B of the simulated posterior distribution (|x1,,xn). To solve this problem, simply compute the sample mean of your simulated chain (t) after discarding a 20% burn-in.
Solution
## Solution goes here –
10) Construct a histogram of the simulated posterior (|x1,,xn) after discarding a 20% burn-in.
Solution
## Solution goes here –
11) Run the Metropolis-Hastings algorithm to estimate the mixture parameter using a Beta( = 15, = 2) prior distribution on mixture parameter . Repeat exercises 6 though 10 using the updated prior.
Solution
## Solution goes here –
lineplot:
Construct a lineplot of the simulated Markov chain from exercise (6). The vertical axis is the simulated chain (t) and the horizontal axis is the number of iterations.
Solution
## Solution goes here –
ACF:
Plot the empirical autocorrelation function of your simulated chain (t). I.e., run the function acf(). A slow decay of the chains autocorrelations indicate poor mixing properties.
Solution
## Solution goes here –
Bayes estimate:
Compute the empirical Bayes estimate B of the simulated posterior distribution (|x1,,xn). To solve this problem, simply compute the sample mean of your simulated chain (t) after discarding a 20% burn-in. Your answer should be close to the MLE.
Solution
## Solution goes here –
Posterior: Construct a histogram of the simulated posterior (|x1,,xn) after discarding a 20% burn-in.
Solution
## Solution goes here –
Reviews
There are no reviews yet.