[SOLVED] CS STAT340 Lecture 05: Estimation

$25

File Name: CS_STAT340_Lecture_05:_Estimation.zip
File Size: 310.86 KB

5/5 - (1 vote)

title: STAT340 Lecture 05: Estimation
author: and Wu
date: September 2021
output: html_document

Copyright By Assignmentchef assignmentchef

These notes will discuss the problem of *estimation*. They are based in part on notes by [ ](http://pages.stat.wisc.edu/~karlrohe/index.html).

Estimation refers to the task of giving a value or range of values that are a good guess about some quantity out there in the world.

Often this quantity is the parameter of a model, for example, the mean of a distribution.

__Example:__ Human heights

Lets think back to our human height example.
Recall that our goal was to determine the average human height, $mu$.

We said that it was infeasible to measure the height of every human, but we could measure the heights $X_1,X_2,dots,X_n$ of a few thousand humans and report the mean of that sample (the sample mean), $hat{mu} = n^{-1} sum_{i=1}^n X_i$, where $n$ is the number of humans in our sample.

Thus, we might report the value of $hat{mu}$ (say, 172.1 cm) and state that We estimate the average human height to be $172.1$ cm.

Facts from probability theory (specifically, the law of large numbers, which well talk about soon) state that this sample mean $hat{mu}$ is close to the true population mean $mu$.

But how close is close?

In addition to our estimate $hat{mu}$, we would like to have some kind of notion of how certain we are in our estimate.

In another direction, if we say that we estimate the average human height to be 172.1 cm, we might also be willing to say that $172.3$ cm or $171.8$ cm are also reasonable estimates.

If you have seen *confidence intervals* (CIs) before, both of these ideas should sound somewhat familiar.

## Learning objectives

After this lesson, you will be able to

* Explain the statistical task of estimation and give examples of real-world estimation problems.
* Define the concept of a *statistic* and explain how/why we view a statistic as a random quantity.
* Explain the difference between an estimate and an estimator.
* Explain the concept of confidence intervals and apply one or more methods to build a confidence interval for a given estimation problem.

## Statistical Estimation

__Example:__ Universal Widgets of Universal Widgets of Madison (UW-Madison) company manufactures [widgets](https://en.wiktionary.org/wiki/widget>widgets)
Their widget machine produces widgets all day.

Unfortunately, making widgets is hard, and not all widgets produced by the machine are functional.

Due to randomness in the manufacturing process, a widget is functional with probability $p$, and dysfunctional with probability $1-p$.

UW ships widgets in batches, and they want to ensure that every batch ships with at least 5 functional widgets in it.

Thus, we have two (related) questions to answer:

1. What is a good estimate for $p$?
2. How many widgets should be in a batch to ensure that (with high probability) a batch ships with at least $5$ functional widgets in it?

We will focus on the first of these two questions, but in the course of these lectures you will see how to address the second question quite easily.

__Step 1: Specify a model__

All statistics starts with choosing a model for the world, so lets start there.

What would be a good model for this setting?

Since the outcome of interest here is binary (i.e., it is a yes/no or success/failure outcome), it is natural to model whether a widget is functional or dysfunctional as a Bernoulli random variable with success probability $p$.

That is, we model each widget as being functional with probability $p$ and dysfunctional with probability $1-p$.

Let us assume for now that widgets are __independent__, so that the fact that one widget is functional or not has no bearing on whether or not another widget is functional.

So, we will make the following assumption: *widgets are functional independently with probability $p$*.

Lets implement this model in R.

n <- 200; # We will examine n=200 widgetsp <- 0.8; # Suppose that 80% of widgets are functionalfunctional_widgets <- rbinom(1, size=n, p); # Draw one sample of widgets.cat(functional_widgets); # How many of the n widgets are functional?__Question:__ why is the binomial distribution the right thing to do, here?__Step 2: Estimating $p$__Suppose that we can collect data by observing widgets $1,2,dots,n$.Let’s denote our data by $X_1,X_2,dots,X_n$, where $X_i=1$ if the $i$-th widget is functional and $X_i=0$ if it is dysfunctional.If we examine enough widgets, we know that we can estimate $p$ very well (again, we’ll see that more formally soon).More specifically, the more widgets we examine, the more accurate our estimate will be.Unfortunately, widgets aren’t free.So, here are two questions:1. If we are willing to tolerate an error of, say, 2%, how many widgets do we need to examine?2. Suppose we examine 1000 widgets and observe that 882 of them are functional, so we estimate $p$ to be $882/1000 = 0.882$. How close is this to the true value of $p$?Question 1 is a question about *experiment design*.Specifically, it is a question about *sample size*.How many observations (i.e., how much data) do we need to collect in order to get a certain level of estimation accuracy?Question 2 is a question about the accuracy of a specific estimate, namely the sample mean.We will see below that these two questions are, in a certain sense, two sides of the same coin.So, to start, what do we mean when we say that our estimate will be close to $p$?Let’s see this in action with a simulation.n <- 200; p <- 0.8; # Still n=200 widgets, 80% of which are functional.functional_widgets <- rbinom(1000, size=n, p); # Same experiment as above, but repeat it 1000 times.hist( functional_widgets/n ); # Plot estimates of p, #functional/#observationsabline( v=p, col=’red’, lwd=4 ); # Vertical line at the true value of pLet’s pause and make sure we understand the experiment we just ran.Each data point in the above plot corresponds to a single instance of our experiment, in which we generate $n=200$ widgets, each of which is functional with probability $p=0.8$ (indicated in red in the plot).To estimate $p$, we count up what fraction of the $200$ widgets in our are functional.Since the data are random, our estimate of $p$ is also random.The histogram above illustrates that randomness.Sometimes our estimate is a bit higher than the true value of $p$, sometimes it is lower.But as we can see, most of the time our estimate is close to $p$.## Aside: Estimators, Estimates and StatisticsBefore continuing our investigation of widgets, let’s take a moment to discuss things in more generality.Suppose we have our data $X_1,X_2,dots,X_n$.If we performed another experiment, we would presumably see a different set of values for our data.That is reflected in the fact that we model the $X_1,X_2,dots,X_n$ to be random variables.So, in our example above, $X_i$ is a Bernoulli random variable representing whether or not widget $i$ is functional.We might observe that six our of ten widgets are functional, but that could be entirely due to chance– on another day we might observe that seven out of ten are functional, or four out of ten.We typically summarize our data with a *statistic*, say $S(X_1,X_2,dots,X_n)$.In our example above, we summarized the data with the sample mean $S(X_1,X_2,dots,X_n) = n^{-1} sum_{i=1}^n X_i$, but this statistic $S$ can be any function of your data.You’ll explore some in your homework.We usually choose the function $S$ to be so that $S(X_1,X_2,dots,X_n)$ will tend to be close to a quantity of interest.We call this function $S$ an *estimator* for that quantity of interest.In our widgets example, we are interested in estimating the probability $p$, and we chose our statistic to be the sample mean of the data (i.e., the fraction of functional widgets).That is, we used the sample mean as our estimator for $p$.We call a particular value of this estimator (i.e., $S$ applied to a particular choice of data) an *estimate*.So, if we observe 162 functional widgets in our sample of $n=200$ widgets, our estimate of $p$ is $162/200 = 0.81$.Now, since the data $X_1,X_2,dots,X_n$ are random, and $S = S(X_1,X_2,dots,X_n)$ is a function of the data, that means that our statistic $S$ is also random.So, in just the same way that $X_i$ has a distribution (e.g., $X_i sim operatorname{Bernoulli}(p)$ above), $S$ also has a distribution.We usually call this distribution the *sampling distribution*, because it describes the behavior of our statistic, which is a function of the sample.## More data, better accuracySo, let’s turn back to the first of our two questions: If we are willing to tolerate an error of, say, 2%, how many widgets do we need to examine?Well, let’s start by looking again at the histogram of estimates from 1000 different runs with $n=200$ and $p=0.8$.n <- 200; p <- 0.8; # Still n=200 widgets, 80% of which are functional.functional_widgets <- rbinom(1000, size=n, p); # Same experiment as above, but repeat it 1000 times.hist( functional_widgets/n ); # Plot the estimates of p, #functional/#observationsabline( v=p, col=’red’, lwd=4 ); # Vertical line at the true value of pMost of the estimates are between $0.72$ and $0.88$.Let’s try increasing $n$ from $n=200$ to $n=400$. That is, let’s try gathering more data.n <- 400; p <- 0.8; # n=400 widgets instead of 200, but still 80% are functional.functional_widgets <- rbinom(1000, size=n, p); # Repeat the experiment 1000 times.hist( functional_widgets/n ); # Plot the estimates of p, #functional/#observationsabline( v=p, col=’red’, lwd=4 ); # Vertical line at the true value of pIf you compare this plot to the one above, you’ll see that the values are more tightly concentrated about $p=0.8$.That’s because we have more data.In fact, all right. Let’s just display them both in one plot.p <- 0.8; # Still n=200 widgets, 80% of which are functional.functional_widgets_200 <- rbinom(1000, size=200, p);functional_widgets_400 <- rbinom(1000, size=400, p);# Put the data into a data frame to pass to ggplot2.phat <- c(functional_widgets_200/200, functional_widgets_400/400 ); # “p hat”, i.e., estimate of p.n <- c( rep(200, 1000), rep(400, 1000) );df <- data.frame( ‘n’=as.factor(n), ‘phat’=phat);library(ggplot2)pp <- ggplot( df, aes(x=phat, color=n, fill=n));pp <- pp + geom_histogram( aes(), position=’identity’, alpha=0.5, binwidth=0.01);pp <- pp + geom_vline( xintercept=p, color=’red’);Looking at the plot, we see that the $n=400$ estimates tend to cluster more tightly around the true value of $p$ ($p=0.8$, indicated in red) compared with the $n=200$ estimates.Gathering more data (i.e., observing more widgets) gives us a more accurate (on average!) estimate of $p$.Just to drive this home, let’s increase $n$ even more.p <- 0.8; # Still using 80% functional rate.widgets_100 <- rbinom(1000, size=100, p); # Note: there are “cleaner” ways to build this data frame,widgets_200 <- rbinom(1000, size=200, p); # but those ways are harder to understand on a first glance.widgets_400 <- rbinom(1000, size=400, p); # At this stage of your career, “clusmy but easy to read”widgets_800 <- rbinom(1000, size=800, p); # is better than “short but cryptic”# Put the data into a data frame to pass to ggplot2.phat <- c(widgets_100/100, widgets_200/200, widgets_400/400, widgets_800/800 ); # “p hat”, i.e., estimate of p.n <- c( rep(100, 1000), rep(200, 1000), rep(400, 1000), rep(800, 1000) );df <- data.frame( ‘n’=as.factor(n), ‘phat’=phat);pp <- ggplot( df, aes(x=phat, color=n ));pp <- pp + geom_density( size=2 ); # Using a smoothed density instead of histogram for easy comparisonpp <- pp + geom_vline( xintercept=p, color=’red’, size=2);So more data (increasing $n$) gives us a more accurate estimate (i.e., makes our estimate concentrate closer to the true $p$).But we started out asking about how to *guarantee* that our estimate is close to $p$.There is a problem with this, though.Our data is random, and sometimes we get unlucky.So we can never guarantee that our estimate is close.Let’s take a short aside to make this more precise.## Aside: probability of “bad” eventsWe saw in our simulation above that our estimate $hat{p}$ of $p$ was usually close to $p$, and making $n$ bigger (i.e., collecting more data) meant that $hat{p}$ was closer to $p$, on average.Can we guarantee that, if $n$ is big enough, then $hat{p}$ will be arbitrarily close to $p$?Unfortunately, the answer is no.To see what this is the case, let’s consider a very specific event: the event that all $n$ of our widgets are functional.When this happens, our estimate of $p$ ishat{p} = n^{-1} sum_{i=1}^n X_i = n^{-1} n = 1.This event has probabilityPr[ X_1=1, X_2=1, dots, X_n = 1 ]= prod_{i=1}^n Pr[ X_i = 1 ] = p^n.Now, unless $p=0$, this means that with some positive probability, the event $X_1=X_2=cdots=X_n=1$ occurs.That is, no matter how large $n$ is, there is still some small but positive probability that our estimate is simply $hat{p} =1$.What that means is that we can never give a 100% guarantee that our estimate is arbitrarily close to the true value of $p$.Now, with that said, notice that as $n$ gets larger, the probability of this bad “all widgets are functional” event gets smaller and smaller.Roughly speaking, this is what we mean when we say that more data gives us a more accurate estimateThe probability that our estimate is far from the true value of $p$ gets smaller and smaller as we increase $n$.The law of large numbers, which we will discuss soon, will let us say something both stronger and more precise than this, but the above example is a good illustration of the core idea.## More data, better accuracy part IIInstead of trying to do more math, let’s try and code up an experiment to get a handle on this.Let’s simplify things a bit by writing a function that will generate a random copy of $S(X_1,X_2,dots,X_n)$ given a choice of $n$ and the true value of $p$.simulate_S <- function( n, p ) {functional_widgets <- rbinom(1, size=n, prob=p);return(functional_widgets/n); # Our statistic is the fraction of the n widgets that are functional.simulate_S(200, 0.8) # Simulate n=200 widgets with functional probability p=0.8Now, we want to use this function to estimate the probability that our estimate is within $0.02$ of $p$.That is, we want to estimatePrleft[ S in (p-0.02, p+0.02) right]Prleft[ | S(X_1,X_2,dots,X_n) – p | < 0.02 right]We *could* explicitly compute this number.After all, we know how to compute the probability distribution of the Bernoulli and/or Binomial distributions.But instead, let’s just use Monte Carlo estimation.# Here’s a function that will take our estimate S (= phat) and check if it is within 0.02 of p or not.check_if_S_is_good <- function( S, p ) {return( abs(S-p) < 0.02)# Now, let’s simulate a lot of instances of our experiment# and count up what fraction of the time our estimate is “good”N_MC <- 2000; # Repeat the experiment 2000 times. N_MC = “number of Monte Carlo (MC) replicates”n <- 200; p <- 0.8; # Still using n=200, p=0.8library(tidyverse) # We’re going to use some tidyverse tools to make things easier to read# Create a data frame to store the outcome of our experiment.# We are initially filling entries with NAs, which we will fill in as we run.monte_carlo <- data.frame(replicate = 1:N_MC, S = rep(NA, N_MC), S_good = rep(NA, N_MC));# Let’s just check what the data frame looks like before we populate it.head( monte_carlo )# For each replicate, run the experiment and record results.# We want to keep track of the value of S and whether or not S was good.for(i in 1:N_MC){monte_carlo$S[i] <- simulate_S( n, p );monte_carlo$S_good[i] <- check_if_S_is_good(monte_carlo$S[i], p)monte_carlo <- as_tibble(monte_carlo);monte_carlo %>% summarise(`P(|S p| < .02)` <- mean(S_good) )So about half of our estimates were within 0.02 of $p$.Our experiments above suggested that we could improve this by increasing $n$, so let’s try that.# This is the exact same setup except we’re changing n from 200 to 400.N_MC <- 2000; n <- 400; p <- 0.8; # Still using p=0.8 and 2000 Monte Carlo trials.# Note that we don’t really have to create the data frame again.# We could, if we wanted, just overwrite it, but this is a good habit to be in# to make sure we don’t accidentally “reuse” old data.monte_carlo <- data.frame(replicate = 1:N_MC, S = rep(NA, N_MC), S_good = rep(NA, N_MC));for(i in 1:N_MC){monte_carlo$S[i] <- simulate_S( n, p );monte_carlo$S_good[i] <- check_if_S_is_good(monte_carlo$S[i], p)monte_carlo <- as_tibble(monte_carlo);monte_carlo %>% summarise(`P(|S p| < .02)` <- mean(S_good));That’s an improvement! But still about 30% of the time we’re going to be more than 0.02 away from $p$…# Just as a check, let’s plot a histogram again.monte_carlo %>% ggplot(aes(x = S)) + geom_histogram(bins = 30) + geom_vline( xintercept=p, col=red )

__Exercise:__ play around with $n$ in the above code to find how large our sample size has to be so that $Pr[ |S-p| le 0.02 ] approx 0.95$.

So we can never have a perfect guarantee that our estimate is within, say, $0.02$ of the truth.

Instead, we have to settle for guarantees like with probability 0.95, our estimate is within $0.02$ of the truth.

This is the idea behind confidence intervals.

## What are we trying to estimate?

A lot of this discussion may feel a bit like hypothesis testing.

But this is different. We are interested in

1. Estimating $p$ and
2. Knowing how good our estimate is (i.e., how likely is it that we are within 0.02 of the truth?).

Generally, once we compute the statistic $S$, we could just report it and be done with it.
We estimate $p$ to be $0.785$, and leave it at that.

But this leaves open the question of how close our estimate is to the truth.

If we knew $p$, like in the examples above, we could say how close we are, but we dont know $p$.

So, how can we say how close we are without knowing the true value of the thing we are estimating?

Above, $p$ was defined as a parameter in a model. However, often times, parameters can be imagined as something different. Here are two other ways:

* Imagine getting an infinite amount of data. What would be the value of $S$ with an infinite amount of data?
* Imagine repeating the whole experiment lots of different times and on experiment $i$ you created statistic $S_i$. What is the average of those statistics $S_1,S_2,dots$?

For most functions of the data $S$, these two values are the same thing (though the first one might be a bit easier to think about).
However, if they are different (and sometimes they are), it is the second one that we are actually going to use.
That second value is in fact the expected value of the statistic, $mathbb{E} S(X_1,X_2,dots,X_n)$, which we will often shorten to just $mathbb{E} S$, with it being understood that $S$ depends on the data $X_1,X_2,dots,X_n$.

__Example:__ The maximum of the $X_i$ is one statistic for which those two notions are not the same. So is the minimum. Why?

So, to recap, here is the problem:

* $S$ is a random variable.
* We only observe one example of it, but we want to estimate $mathbb{E} S$.

## Point estimation: a good place to begin

So, we want an estimate of $mathbb{E} S$. Well, what better estimate than $S$ itself?

This isnt an arbitrary decision there are good mathematical reasons behind this.

Weve mentioned the law of large numbers (LLN) a couple of times already this semester. Lets look at it a bit closer.

The *weak law of large numbers* states th

CS: assignmentchef QQ: 1823890830 Email: [email protected]

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] CS STAT340 Lecture 05: Estimation
$25