Wenhao Jiang
Logistics
Part 1: Discrete Dependent Variable
1.1 Binary Response
## load data load(“data/dat.RData”)
1.2 Bernoulli Distribution
• When the outcome is binary, Yi ∈ {0,1}, it is common to assume that it follows a Bernoulli distribution. A Bernoulli distribution has one parameter, which we might call π, that represents the probability of a “success.” It is the convention to let Yi = 1 be a success and Yi = 0 be a fail; so, πi = Pr[Yi = 1].
• An example of the Bernoulli distribution is the toss of a biased coin. We can plot the expected probability of whether we get a “tail” or “head” given how the coin is biased.
• To model binary outcomes, we are assuming the observed outcome follows a Bernoulli distribution for each unit of observation i with a parameter π that can be predicted with a set of predictors, Xi.
1.3 The Linear Probability Model (LPM)
• Suppose that we have a binary outcome Yi ∈ {0,1}, where we think that the success probability depends on a set of predictors Xi = {Xi1,Xi2,…,Xi3}. We might write this as πi = Pr[Yi = 1|Xi] = π(Xi), where the subscript i shows that the success probability will vary across units in our sample, and where we have emphasized that π(·) is a function of a set of predictors.
• In the LPM, we assume that π(Xi) is a linear function of the predictors. That is,
π(Xi) = E[Yi|Xi] = β0 + β1Xi1 + ··· + βkXik, i = 1,2,…,n.
• Notice that the equation looks exactly the same as the linear regression model we have considered in the previous labs. The only difference is that the outcome is a binary variable.
• Linear Probability Model can be estimated using the same method as we estimate a regular linear model. In R, simply use lm(). In our case, we model the outcome variable trump using all the predictors in the dataframe.
## print summary summary(lpm)
##
## Call:
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0409 -0.1428 0.0118 0.1296 1.0514
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.290119 0.097759 -2.968 0.003059 **
## pid 0.170488 0.003975 42.889 < 2e-16 ***
## log_inc 0.028229 0.009199 3.069 0.002198 **
## female -0.042548 0.017531 -2.427 0.015369 *
## —
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
##
## Residual standard error: 0.3036 on 1217 degrees of freedom
## Multiple R-squared: 0.6245, Adjusted R-squared: 0.6232 ## F-statistic: 506 on 4 and 1217 DF, p-value: < 2.2e-16
• The interpretation of the coefficients of LPM is straightforward: Holding other variables constant, one unit increase in Xk will increase/decrease the probability of Y = 1 by βk.
• We can plot the LPM’s predicted probability of voting for Trump by Party ID.
## create new dataset for predictions pred_dat <- data.frame(
pid = 0:6, # pid values to get predictions for log_inc = median(dat$log_inc, # fix inc at median na.rm = T),
female = 0, # fix gender at male
)
## create a new df of predict probability of voting for Trump yhat <- cbind(pid = 0:6, predict(lpm, newdata = pred_dat, type = “response”, interval = “confidence”)) %>%
as_tibble()
## plot predicted probabilities (save plot in object lpm_plot) lpm_plot <- yhat %>%
ggplot(aes(x = pid, y = fit)) + geom_line(col = “black”) +
geom_ribbon(aes(ymin = lwr, ymax = upr),
fill = “grey”, alpha = .5, col = NA) +
scale_y_continuous( name = “Predicted Probability “, breaks = seq(0, 1, .25)) + scale_x_continuous( name = ” Party Identification”, breaks = seq(0, 6, 1)) +
geom_hline(yintercept = c(0, 1),linetype = 2) + geom_vline(xintercept = seq(0, 6, 1), linetype = 3, col = “grey”) + theme_classic() +
ggtitle(“Probability of Voting for Trump by Party ID”, subtitle = “Results from Linear Probability Model”)
## print the plot
print(lpm_plot)
Probability of Voting for Trump by Party ID
Party Identification
• There are three things to notice:
– As the name of the model suggests, the predicted probability of voting for Trump is increasing linearly with our predictor.
– We see that the confidence interval at pid = 0 (i.e., “Strong Democrats”) and both the confidence interval and our fitted value at pid = 6 (i.e., “Strong Republican”) take on impossible values. There cannot be something like a probability that is smaller than zero or larger than 1.
– The error distribution will not be homoskedastic
• The above issues (especially the second one) motivate a logistic regression model (or a Generalized Linear Model (GLM)) in which the predicted probabilities are guaranteed to lie between zero and one, and we allow semi-non-linear relations between our predictors and the outcome.
Part 2: Logistic Regressions
2.1 Motivations and Basics
• To bound the predicted probability within 0 and 1, we turn to the Sigmoid Function
• f(x) = 1+ 1e−x
• The Sigmoid Function has two desired properties
– 1. f(x) is bounded within 0 and 1
– 2. x has no limit
• The Sigmoid Function is therefore a good candidate to model the probabilities of some categorical dependent variable (e.g., voted for Trump πi = π(Xi)), given some observed characteristics Xi
2.2 Properties of Sigmoid and Logit
• Sigmoid function and logit function are inverse functions for each other
• Sigmoid function: y = 1+1e−x
• Inverse of Sigmoid function: x = 1+1e−y → y = log(1−xx)
• 1+ 1e−x is bounded within 0 and 1. Inversely, the x in log(1−xx) is bounded within 0 and 1
2.3 Odds and Odds Ratios
• We call the term 1−πiπi in the log() function “odds” (probability of “event” divided by probability of no
“event”)
• Odds Odds = 1−πiπi = exp(β0 + β1Xi1 + ··· + βkXik)
• We introduce the notion of odds ratio to interpret βk
OddsXi1+1
• Odds ratio = OddsXi1 == exp(β1)
2.4 Estimate Logistic Regression in R
• Fitting a logistic regression in R is straightforward. If we use the same predictors as those of the LPM discussed above, the code to fit the logistic regression is.
data = dat # data
)
## check the class of the object
class(l_reg)
## [1] “glm” “lm”
## summarize results summary(l_reg)
##
## Call:
## data = dat)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.47611 1.11850 -5.790 7.04e-09 ***
## pid 1.20627 0.06211 19.420 < 2e-16 ***
## log_inc 0.29218 0.10210 2.862 0.004215 **
## female -0.46170 0.19580 -2.358 0.018372 *
## —
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
##
## (Dispersion parameter for binomial family taken to be 1) ##
## Null deviance: 1666.84 on 1221 degrees of freedom ## Residual deviance: 709.81 on 1217 degrees of freedom
## AIC: 719.81
##
## Number of Fisher Scoring iterations: 6
2.5 Model Interpretation
• In the summary for logistic regression models:
– 1. The coefficients in the Estimate column show the estimated regression coefficients, i.e., the βˆk’s. For example, the coefficient of the pid variable suggests that a unit increase in pid is associated with a 1.206 increase in the logit (the log odds) of the probability of voting for Trump.
– 2. We might also interpret the coefficients in terms of odds by exponentiating them. Hence, the odds of the outcome are predicted to increase by a factor of eβˆ1 for each unit increase in pid. For example, a unit increase in pid is associated with an increase of the odds of voting for Trump by a factor of e1.206 = 3.340.
– 3. You can calculating the exponentiated regression coefficients of a logistic regression model by using the coef function to extract the estimated coefficients from the model and, then, use the exp() function:
## extract coefficients l_coef = coef(l_reg)
## print coefficients and their exponentiated form rbind(coef = l_coef, exp_coef = exp(l_coef))
## coef -6.476108455 1.206265 0.2921768 -0.4616960 -0.7601581 ## exp_coef 0.001539791 3.340984 1.3393398 0.6302139 0.4675925
2.6 Predicted Probabilities
• It is always a good idea to plot the predicted probabilities (both for yourself and for your readers). In other words, we want to plot how the probability of the outcome changes when we vary a focal variable
while fixing the remain variables at certain values.
• In R, doing this is quite straightforward. We have already created a new dataset for which we want the predictions above (when plotting the predicted probabilities using the LPM). Let us use the exact same dataset again.
## predict probability of voting for Trump (using logit model) yhat_logit = cbind(pid = 0:6,
predict(l_reg, # model object is different!
newdata = pred_dat, # data for prediction is the same! type = “response”)) %>%
as.data.frame()
• By using the type = “response” option, we will obtain the predicted probabilities. However, the SE of the predicted probabilities is not available.
• SE is available, however, when we predict the logit by setting se.fit = TRUE. We specify the option type = “link” in the predict function. It can be shown that the sampling distribution of the predicted logits follow a Normal distribution in large samples
– As the predicted logits are Normally distributed in large samples, we can use these estimated standard errors to calculate the 95% confidence intervals of the predicted logits. These intervals
will have the form CI = logit(πˆi) ± 1.96 × S.E.d(logit(πˆi)).
– This will give us the confidence interval for the predicted logits. But we want the 95% CIs for the predicted probabilities. Here we use the fact that the inverse-logit function is a strictly increasing function and just apply the function to both end-points of the confidence interval. This will give us the confidence interval for the predicted probabilities. That is, if the 95% CI for the predicted logits has the form (a,b), then interval (logit−1(a),logit−1(b)) will be the 95% confidence interval for the predicted probabilities.
• In R, we can do this as follows:
# predict the logit and standard errors pred_logit <- predict(l_reg, newdata = pred_dat, response = “link”,
se.fit = TRUE) %>%
as.data.frame() %>% select(fit, se.fit)
# calculate 95% CI for logits pred_logit <- pred_logit %>%
mutate(lwr = fit – 1.96 * se.fit, upr = fit + 1.96 * se.fit)
# apply inverse-logit function to get pred. probs and CI pred_p <- pred_logit %>% mutate_at(1:4, function(a){1 / (1 + exp(-a))}) %>% mutate(pid = pred_dat$pid)
# plot predicted probabilities (save plot in object l_plot) l_plot <- pred_p %>%
ggplot(aes(x = pid, y = fit)) + geom_line(col = “black”) +
geom_ribbon(aes(ymin = lwr, ymax = upr),
fill = “grey”, alpha = .5, col = NA) +
scale_y_continuous(name = “Predicted Probability “, breaks = seq(0, 1, .25)) +
scale_x_continuous(name = ” Party Identification”, breaks = seq(0, 6, 1)) +
geom_hline(yintercept = c(0, 1), linetype = 2) +
geom_vline(xintercept = seq(0, 6, 1), linetype = 3, col = “grey”) + theme_classic() +
ggtitle(“Probability of Voting for Trump by Party ID”, subtitle = “Results from Logistic Regression Model”)
# print plot print(l_plot)
Probability of Voting for Trump by Party ID
Party Identification
• Notice that all the predictions and the corresponding confidence intervals lie between zero and one, as desired. Furthermore, we see that our model predicts that the probability voting for Trump is almost zero for Strong Democrats (pid = 0~1) and almost one for Strong Republicans (pid = 5~6).
• This is a much more intuitive presentation of your results (or the meaning of the estimated regression coefficients) than an exponentiated coefficient of 3.341. So, whenever you run these models you should try to plot the predicted probabilities.
Lastly, we can use the ‘ggpubr::ggarrange“ function to compare the LPM and the logistic regression model:
ggarrange(lpm_plot, l_plot, nrow = 1)
Party Identification Party Identification
Reviews
There are no reviews yet.