STAT 614 Week 13: General Strategy for Model Building
Richard Ressler 2019-11-20
1
Learning Outcomes
Develop and apply a General Strategy for variable selection in multiple linear regression.
References: Chapter 12
2
Seven (+1) Step General Strategy for Variable Selection
1. Identify Objectives and Questions of Interest
2. Screen the available variables: Identify candidate variables
3. Exploratory data analysis: Look for Relationships, Assumptions, Correlations
4. Transformations based on EDA.
5. Fit a rich model and look at residuals Iterate
6. If appropriate, use a computer aided technique to choose a suitable subset of explanatory variables.
7. Proceed with analysis with chosen explanatory variables.
8. Communicate Results clearly
3
Step 1: Identify Objectives and Questions of Interest
Example 1: Interested in association of one explanatory variable and one response.
Goal is to determine the association after adjusting for other variables.
Perform variable selection with everything except the explanatory variable of interest,
Once the model is selected, then include the varaiable of interest to test for the association.
4
Step 1: Identify Objectives and Questions of Interest
Example 2: Just want to fish for associations
Iterate through adding/removing variables, making transformations, checking residuals, until you develop a model with significant terms and no major issues.
p-values/confidence intervals dont have proper interpretation. Same problems with multiple comparisons ran many tests
and looked at data a lot to come to final model. You generally build a model and tell stories with it.
5
Step 1: Identify Objectives and Questions of Interest
Example 3: Prediction
Include variables to maximize predictive power, dont worry about interpretation.
6
Step 2: Screen Available Variables
Choose a list of explanatory variables that are important to the objective.
Screen out redundant variables
7
Problems with Including Too Few Variables
You are only picking up marginal associations.
E.g., we already know men make more money than women. We want to see if men still make more money than women when we control for other variables.
Predictions are less accurate.
8
Too few variables: Predictions are less accurate
Prediction Intervals with X
1.2
0.8
0.4
0.0
0.00 0.25
0.50 0.75 1.00
x
9
y
Too few variables: Predictions are less accurate
Prediction Intervals without X
1.2
0.8
0.4
0.0
0.00 0.25
0.50 0.75 1.00
x
10
y
Too few variables: Predictions are less accurate
Prediction Intervals without X
1.2
0.8
0.4
0.0
11
y
Problems with too many variables
Harder to estimate more parameters. Model tends to overfit.
Formally, the variances of the sampling distributions of the
coefficients in the model will get much larger.
Including highly-correlated explanatory variables will really increase the variance of the sampling distributions of the coefficient estimates.
Intuitively, we are less sure if the association of Y and X1 is due to that actual associate or is it mediated through X2?
Predictions are less accurate.
12
Demonstration of adding a highly-correlated variate
X1 and X2 are highly correlated
1.00 0.75 0.50 0.25 0.00
x2_q
1st Quartile
2nd Quartile 3rd Quartile 4th Quartile
0.00 0.25 0.50 0.75 1.00
x1
13
x2
Truth
True model: (Y |X1) = X1 +
Fit Model: (Y |X1, X2) = 0 + 1X1 + 2X2 +
Correlation between X1 and X2 is 0.9994352.
We will simulate new Y s and plot the resulting OLS estimates.
14
Demonstration: Black is true 1
x2_q
1st Quartile
2nd Quartile 3rd Quartile 4th Quartile
0.9
0.6
0.3
0.0
0.00 0.25 0.50 0.75 1.00
x1
15
y
Demonstration: Black is true 1
0.9
0.6
0.3
0.0
x2_q
1st Quartile
2nd Quartile 3rd Quartile 4th Quartile
0.00 0.25 0.50 0.75 1.00
x1
16
y
Demonstration: Black is true 1
1.2
0.8
0.4
0.0
x2_q
1st Quartile
2nd Quartile 3rd Quartile 4th Quartile
0.00 0.25 0.50 0.75 1.00
x1
17
y
Demonstration: Black is true 1
1.00
0.75
0.50
0.25
0.00
0.00 0.25 0.50 0.75 1.00
x1
x2_q
1st Quartile
2nd Quartile 3rd Quartile 4th Quartile
18
y
Demonstration: Black is true 1
1.2
0.8
0.4
0.0
0.00 0.25 0.50 0.75 1.00
x1
x2_q
1st Quartile
2nd Quartile 3rd Quartile 4th Quartile
19
y
Variability of 1
6 4 2 0
2
1000 iterations
x1
x1_with_x2
Model
20
Beta_1
Steps 3 through 5
3. Exploratory data analysis.
Tons of scatterplots.
Look at correlation coefficients.
4. Transformations based on EDA.
5. Fit a rich model and look at residuals.
Look for curvature, non-constant variance, and outliers. Iterate the above steps until you dont see any issues.
21
Step 6
If appropriate, use a computer-aided technique to choose a suitable subset of explanatory variables.
F-test if nested models step()
22
Step 7
Proceed with analysis with chosen explanatory variables.
8. Tell stories with the data using p-values, coefficient estimates, confidence intervals, etc. . .
23
Reviews
There are no reviews yet.