Department of Industrial Engineering & Operations Research
IEOR160 Operations Research I
Project Due Date: 05/01/2017
Optimization is an essential part of statistics and data analysis. In this project you will use nonlinear
and integer optimization to answer questions for a best subsets problem common in regression
analysis for diabetes progression.
Problem description
For 442 diabetes patients ten baseline variables were collected: age, sex, body mass index (BMI),
average blood pressure (BP), and six blood serum measurements. Additionally, for each patient, a
quantitative measure of disease progression one year after the baseline was also collected. The data
can be found at http://web.stanford.edu/hastie/Papers/LARS/diabetes.data. Standard-
ized data is available in http://web.stanford.edu/hastie/Papers/LARS/diabetes.sdata.txt.
You are asked to construct a regression model to predict the disease progression from the baseline
observations. The model should be able to accurately predict disease progression for future patients,
and also indicate which independent variables are important factors in disease progression.
Part 1
Assuming that the relationship between the dependent variable y (measure of disease progression)
and the independent variables x1, . . . , x10 is linear, i.e.,
y =
10
j=1
jxj + ,
where is a normally distributed error term, regression coefficients were fitted with the least squared
error approach (using the first 250 data points), and the following results were observed.
independent var. std. error t-stat p-value
Age -59.6 80.4 -0.7 0.459
Sex -241.6 84.6 -2.9 0.005
BMI 535.1 95.0 5.6 0.000
BP 241.7 91.7 2.6 0.009
S1 -844.9 627.7 -1.3 0.180
S2 407.4 525.2 0.8 0.439
S3 -224.3 311.0 -0.7 0.471
S4 285.2 221.0 1.3 0.198
S5 762.4 243.8 3.1 0.002
S6 169.6 87.1 1.9 0.053
In order to increase the predictive value and interpretability of the regression coefficients, doctors
would like to have a model that only uses four independent variables.
IEOR160 Project 1
http://www.stanford.edu/~hastie/Papers/LARS/diabetes.data
http://www.stanford.edu/~hastie/Papers/LARS/diabetes.sdata.txt
For questions (a)(d), use only the first 250 patients in the dataset.
(a) Based on the p-values of the regression coefficients given above, which four independent vari-
ables seem more important?
(b) In order to find the best combination of four independent variables, a possible method (imple-
mented in most statistical software) is to fit a regression for all possible combinations of four
variables, and choose the best one (e.g., with the largest R2). In this case, it involves fitting(
10
4
)
= 210 regressions. Alternatively, heuristics are also used (e.g. stepwise selection). In
this question, you need to use a simple heuristic: using only the independent variables corre-
sponding to Sex, BMI, BP, S5 and S6, fit a regression for all possible subsets of four variables
and select the one with best R2 value. What variables are used? What are the regression
coefficients? What is R2?
(c) Use Lasso with regularization parameter {200, 220, 240, 260, 280, 300, 320, 340, 360, 380, 400}.
Which value of would you use? What are the corresponding regression coefficients? What
is the value of R2?
(d) Use mixed-integer optimization of find the model with best R2 value that uses four independent
variables (you may assume that there exists an optimal solution where |j | 1000 for j =
1, . . . , 10). What are the regression coefficients? What is the value of R2?
(e) We now want to test the out-of-sample accuracy of the five methods used above. Using the
regression obtained in parts (b)(d) and the regression coefficients presented in the table, which
method results in a better prediction for the remaining patients in the dataset (i.e., patients
251 to 442)?
Part 2
Now suppose the regression model includes all second order interactions of the independent variables:
y =
10
j=1
jxj +
10
j=1
10
k=j
xjxkjk + .
We are now interested in methods that use at most 10 independent variables (out of 65). Observe
that in this case approaches that enumerate all possible subsets need to compute
(
65
10
)
1.7 1011
regressions.
For questions (f) and (g), use only the first 250 data points.
(f) Use Lasso to find regression coefficients that satisfy the restriction of having only 10 indepen-
dent variables. Which value of the regularization parameter did you use? Why? What are the
regression coefficients? What is R2?
(g) Use mixed-integer optimization of find the model with best R2 value that uses 10 indepen-
dent variables (you may assume that there exists an optimal solution where |j | 1000 for
j = 1, . . . , 10 and |jk| 5000 for j = 1, . . . , 10, k = j, . . . , 10). What are the regression
coefficients? What is R2?
(h) What is the out-of-sample accuracy of the methods used in part (f)(g)?
IEOR160 Project 2
Reviews
There are no reviews yet.