[SOLVED] Bayesian Preliminary

$25

File Name: Bayesian_Preliminary.zip
File Size: 188.4 KB

5/5 - (1 vote)

Preliminary

On the Moodle page of the course, in the Workshops section you will find a directory called Datasets & Source Code. The
directory contains the HMEQ dataset (both in its original form and after the missing values for the categorical variables
REASON and JOB have been identified. It also contains a file containing functions that I designed for these workshops,
my Rfunctions.R. This file is updated weekly so download the most recent version every week.

1. Download the datasets into the R directory you are using for the course

2. Set the working directory in RStudio to that directory

3. Source the file my Rfunctions.R by typing:

> source(my Rfunctions.R)

This command will load every required package. If a required package is not available it will download and install
it. If the command terminates with an error please let me know.

4. Load the dataset:

> Data <- read.csv(“clean hmeq.csv”)We next assign observations to the training and test set. To ensure reproducibility set the seed to 1 as in previousworkshops.> set.seed(1)

> train <- sample(1:nrow(Data), floor(0.7*nrow(Data)))> Data.train <- Data[train,]> Data.test <- Data[-train,]Identifying Missing ValuesMissing values pose a challenging problem in real-world datasets. If the proportion of missing data is very small, thenexcluding the few samples with missing features may be the best strategy in order not to bias the analysis, but this is trueonly if the missing values are Missing Completely at Random (MCAR). While some quick fixes such as mean-substitutionmay yield decent performance in some cases, such simple approaches introduce bias into the data. For instance, applyingmean substitution leaves the mean unchanged (which is desirable) but decreases variance, which may be undesirable.Furthermore, using mean substitution implicitly assumes that the missing values are MCAR. This can be correct or not,but in any case you need to be aware of what you are doing and its implications.It is important to understand that whenever we substitute (also called impute) missing values we are inserting inaccurateinformation in the dataset, since it is certain that the values we insert will not to be identical to the real values which arenot available.To estimate the proportion of missing values for each variable in the dataset you can use the pMiss() function (defined inthe file my Rfunctions.R). To call the function to measure the proportion of missing values for variable we use its nameand provide the appropriate argument.> pMiss(Data$NINQ)

> pMiss(Data$CLNO)

It is clearly inecient for us to manually call a function for each of the variables of a data frame. Automating processes
is something that computers are really good at so we should seek a way to do this automatically. One way would be to
write a for loop.

1. First we define a numerical vector of length equal to the number of columns of Data. This vector will store the
proportion of missing values that we will compute in the next step.

# vector with numeric elements of length ncol(Data)
> v = vector(length=ncol(Data), mode = numeric)

YIRU

YIRU

YIRU

2. Then we use a for loop, to iterate over all the variables (columns) in the dataset and estimate the proportion of
missing values for each.

# Execute pMiss function for each column, Data[,i]
> for (i in 1:ncol(Data)) { v[i] = pMiss(Data[,i]) }

Instead of using a for loop as before it is possible to use the R function apply(). This function applies to a data frame
(first argument) a function (third argument) either along the rows (when the second argument=1) or along the columns
(when the second argument=2).

> v = apply(Data,2,pMiss)

To find the number of observations containing a missing value you can use either a for loop or apply, as follows:

1. Note that we now need to iterate over the rows of Data, and not the columns.

> count=0;

> for (i in 1:nrow(Data)) { count = count + anyNA(Data[i,]) }

2. The command below computes the number of observations with missing values by summing the output from the
apply() function.

> sum( apply(Data,1,anyNA) )

To get a proportion we only need to divide by the number of observations = number of rows

> sum( apply(Data,1,anyNA) ) / nrow(Data)

Create Indicator Variable for Missing Values

As discussed in the lectures to account for non-ignorable missing values, and in particular to account for the case when the
presence of a missing value is informative about the class to which an observation belongs (recall what we have discussed
in previous workshops about missing values for the DEBTINC variable) it is recommended to introduce a binary variable
that indicates whether a value is missing or not. The function na indicator() is provided for this reason. The function
takes as input a data frame and returns a data frame containing all the original variables, as well as an indicator (binary)
variable for each of the variables that contain missing values. The name of the indicator variable is M {original var.
name}.

> Data.ind <- na indicator(Data)Inspect the new data frame to understand what the above function did. For instance,> str(Data.ind)

Multiple Imputation of Missing Values

The mice package provides advanced methods to impute (i.e. replace) missing values with plausible values using a rich
variety of methods. MICE is designed to accommodate the case of Missing at Random (MAR), which means that the
probability that a value is missing depends on other independent variables, and therefore that one can attempt to predict
the missing values of a variable using the other variables. It imputes data on a variable by variable basis by specifying
an imputation model per variable. The default is to use the same imputation method for all the variables with missing
values.

# Library is loaded automatically when you source the file my Rfunctions.R
> library(mice)

# Enables a better understanding of the pattern of missing data
> md.pattern(Data)

The output of this function (see Fig. 6) is interpreted as:

3364 observations have no missing values

YIRU

YIRU

Figure 6: Output from md.pattern()

166 observations have missing values for MORTDUE only

3 observations have missing values for both VALUE and MORTDUE

In general, the number of missing values appears in the last column of the output, the variables for which missing
values occur are indicated with a 0 in the corresponding row, and the associated number of observations is the first
element in each row.

The final row reports the number of observations that have a missing value for each variable. In our case the table
is so large that only part of it is printed so you cant see the last row. To overcome this, try:

# The tail() function prints the last lines of output
> tail(md.pattern(Data))

A very informative visualisation of the presence and co-occurrence of missing values is oered by the aggr() function
(which requires the VIM package).

# This should be loaded when you source the file my Rfunctions.R
> library(VIM)

> mice plot <- aggr(Data, numbers=TRUE, sortVars=TRUE)Missing Value ImputationTo impute missing values use the mice() function:# The parameters specified are the default values (except for m)> Data.imp <- mice(Data, m=1, method = “pmm”, seed = 10)Lets explain the parameters used in mice(): m = refers to the number of imputed datasets that will be created. Here we will use 1. method = refers to the imputation method used. Here we chose Predictive Mean Matching (pmm) which is therecommended (default) method in the package. There are many other available methods in the package:# Returns list of imputation methods available> methods(mice)

More detailed information about what each methods is can be found in the documentation of this function

> ?mice

seed Setting the random number seed to a given number ensures reproducibility of results. That is if you execute
the same command at any other point using the same seed you will get exactly the same predictions for the missing
values.

After an attempt to impute the missing values it is useful to inspect the distribution of missing values and compare it
with the distribution of the non-missing observations. You can construct scatterplots using xyplot() and density plots
using densityplot(), from the lattice package.

# Should be loaded after you source the file my Rfunctions.R
> library(lattice)

In the scatterplot shown in Fig. 7 I set the type of points that I would like to use pch=18 and also set a transparency
level, alpha=0.4, to enable me to distinguish more easily which observations are imputed (purple) and which are not
(blue). (Values of alpha close to zero make points maximally transparent, while a value of 1 makes the points solid. This
is a useful option when you make scatterplots of large datasets in which case if you dont use transparency the entire
scatterplot is black.)

# Scatterplot of imputed data
> xyplot(Data.imp, DEBTINC MORTDUE + VALUE, pch=18, alpha = 0.4, auto.key=TRUE)

MORTDUE + VALUE

D
E

B
TI

N
C

0

50

100

150

200

0e+00 2e+05 4e+05 6e+05 8e+05

MORTDUE

0e+00 2e+05 4e+05 6e+05 8e+05

VALUE

FALSE
TRUE

Figure 7: Indicative scatterplot of observed and imputed values for DEBTINC against MORTDUE and VALUE

We want the distribution of the imputed points (purple) to be similar to that of the observed data (blue). In this case,
the imputed values can be considered plausible. The density plot in Fig. 8 is obtained through:

> densityplot(Data.imp)

You can assess the dierence of imputing missing values using a model, as opposed to taking the default approach of
using the mean or the mode, by comparing the distribution of the imputed values from. It is very important to understand
that when you are imputing using the mean, the mode, or in any way that doesnt involve the other variables, you are
eectively assuming that values are Missing Completely At Random (and not Missing At Random) because in this case
there is no attempt to predict the missing values using the other variables.

# Replace missing values using the mean
> Data.imp2 <- mice(Data, m=1, method = “mean”, seed = 10)You will get a warning saying that the mean cant be use to replace categorical variables. This makes sense, and if youwant to do this correctly you need to specify that for these variables you want to replace them using the mode. (Recall thatDensity0.0e+001.0e052.0e050e+00 1e+05 2e+05 3e+05 4e+05MORTDUE0.0e+004.0e068.0e060e+00 2e+05 4e+05 6e+05 8e+05VALUE0.000.020.040.060 10 20 30 40YOJ0.00.51.01.52.02.50 2 4 6 8 10DEROG0.00.51.00 5 10 15DELINQ0.0000.0020.0040 500 1000CLAGE0.00.20.40.60 5 10 15NINQ0.000.010.020.030.040 20 40 60CLNO0.000.010.020.030.040.050.060 50 100 150 200DEBTINCFigure 8: Distribution of imputed and existing valuesyou can set the method used to impute each variable with missing values separately) Use the previous plots to comparethe distribution of imputed values with that of the observed.As discussed in the lectures using a constant to replace missing values is not a recommended approach. So we will use theoriginal multiple imputation results (Data.imp) to replace missing values. This is done through the complete() function.# Missing values have been replaced with the imputed values in the first dataset> Data.comp <- complete(Data.imp,1)Verify that there are no missing values in the new dataset, Data.comp:> apply(Data.comp,2,anyNA)

To combine missing value imputation with the creation of the missing value indicator variables, we can either repeat all
the above steps but giving as input to the mice() function the data frame Data.ind (which contains the missing value
indicators), or we can simply add the indicator variables from Data.ind as additional columns to Data.imp. The latter
is a lot easier, and can be achieved using the function cbind() which binds data frames column-wise.

First verify that the missing value indicator variables are stored in columns 14 to 24 in data frame Data.ind.

> str(Data.ind)

> names(Data.ind)

> names(Data.ind)[14:24]

Next bind together the data frame Data.comp (containing the dataset with imputed values) and the columns with missing
value indicators from Data.ind

> Data.all <- cbind(Data.comp, Data.ind[ ,14:24])Make sure that everything is correct by inspecting the last data frame you created.> str(Data.all)

> apply(Data.all, 2, anyNA)

To avoid having to repeat all the above steps (even if they are in a script) every time you want to analyse/ mine this
dataset you can store the complete dataset into a new CSV file. The function write.csv() is the analogue of read.csv()
for storing data frames in comma separated format.

> write.csv(Data.all, file = comp hmeq.csv, row.names = FALSE)

The option row.names=FALSE causes R not to store the row names. Since our observations dont have a name that would
create an additional column which would correspond to the number of the row of each observation.

Logistic Regression Model and Variable Selection

From now on whenever we build logistic regression models we will be using the dataset with imputed missing values and
missing value indicator variables. I therefore overwrite the previous data frame Data with the data frame I just created.
(You can always recover the original dataset by loading the data in the file clean hmeq.csv).
> Data <- Data.all> Data.train <- Data[train,]> Data.test <- Data[-train,]Stepwise Variable Selection using Akaikes Information CriterionStepwise selection starts o with a model containing only the intercept. At each step the variable that improves the mostthe model selection criterion chosen is added. After introducing the new variable the method considers which, if any, ofthe previously inserted variables is no longer contributing significantly to the model selection criterion. If such a variableis found it is removed from the model. Therefore each step of stepwise selection performs one step of forward selection,and one step of backward selection.In logistic regression we we will use Akaikes Information Criterion (AIC) to attempt to strike the optimal balancebetween model complexity and model fit (on the training set!). AIC is defined as,AIC = 2k 2 ln(L),where k is the number of explanatory variable (not coecients) in the model, and L is the value of the likelihood functionfor the model (hence ln(L) is the log-likelihood). The best model minimises the AIC.Stepwise selection is implemented by the function step():# Logistic regression with no predictors, only intercept> reg0 <- glm(BAD 1, data = Data.train, family = binomial)# Logistic regression containing all variables in Data#(note: . is shorthand notation for all variables in the data frame except the target)> reg1 <- glm(BAD ., data = Data.train, family = binomial)Calling the step() function requires the specification of,1. Initial model: reg02. scope provides a formula containing the full range of variables that are considered3. direction specifies whether we want forward, backward or both (which means stepwise) selection> LRStep <- step(reg0, scope=formula(reg1), direction=”both”)To see the variables in the final model, and summary information do:> formula(LRStep)

> summary(LRStep)

The resulting model includes a large number of variables as shown in Fig. 9.

The order in which the variables were inserted into the model is reflected in the order in which the coecients appear in
the summary table (see Fig. 9). You can verify this by inspecting the ANOVA table which also reports how AIC changed
at each step:

> LRStep$anova

Note from the output of the summary() function that there are five coecients associated with the variable JOB. This is
because JOB is a nominal variable (factor) that takes 6 dierent values, and hence 5 binary variables are used to encode it.
For such variables it is not necessary that the p-values associated with the estimated coecients of all the binary variable

Figure 9: Summary of model selected through stepwise selection using AIC

are close to zero to be included. In other words if one or more binary variables corresponding to a categorical variable are
included then the rest will be automatically included irrespective if they are significant.

Although in the workshop we will continue to discuss other aspects of model building, when you are doing this to
actually solve a problem this is a stage at which you need to think very carefully about what your findings imply for the
task you are trying to solve. Recall that the order in which variables are included in the model is an indication of relative
importance.

The Bayesian information criterion is similar to AIC, with the only dierence being in the multiplier of the term k. In
particular,

BIC = ln(n)k 2 ln(L),

where n is the number of observations used to estimate the model. Since for moderate sample size log(n) 2 the BIC
typically yields more parsimonious (i.e. smaller) models. To perform variable selection with this criterion you need to
alter only one setting the step() function.

> LRStep2 <- step(reg0, scope=formula(reg1), direction=”both”, k=log(nrow(Data.train)))Use BIC to select variables and compare the results to those of the obtained using AIC for model selection. Using thematerial from the previous workshops you should be able to assess their performance (on either the training or test set)using ROC curves as well as the area under the ROC (AUC) criterion. We will not repeat this here.Cross ValidationNext we consider K-fold cross-validation (CV) for model selection. K-fold cross-validation first splits the training setinto K non-overlapping and exhaustive partitions (called folds). Each partition is used as a test set for a model that isestimated using the remaining (K 1) partitions as a training set. By repeating this process K times, each time usinga dierent fold for assessing performance, every fold is used once as a test set once, and consequently each observationis used once to assess performance and K 1 times to estimate model parameters. An estimate of the generalisationperformance is provided by the average performance over the K repetitions.We will use the typical setting K = 10. To perform 10fold cross validation I need to assign each observation from thetraining set to one of 10 folds. To this end, I will create a vector my folds that will contain the index of the fold to whicheach observation is assigned. Since I will use 10 folds the index will be an integer between 1 and 10, and these integersmust be allocated randomly to the dierent observations. I will use the sample() function as I did when I selected theobservations to be allocated to the training set. This time however, I want to sample numbers from 1 to 10, as many astimes the observations in the training set, and because there will clearly more than one observations allocated to the samefold I have to perform sampling with replacement.# To ensure reproducibility> set.seed(2)

# Sample with replacement nrow(Data.train) integers in range [1,10]
> my folds <- sample(1:10, nrow(Data.train), replace = TRUE)Typically the misclassification error is used as a performance measure in CV, but this is not necessary, and any performancemeasure that is deemed appropriate can be used instead. Because in our case the costs of dierent types of errors arenot symmetric, and moreover one class is much more frequent than the other, using the misclassification error is notadvisable. Instead we will use the Area Under the ROC curve (AUC). The function to perform cross validation using theAUC is called crossValid() and is defined in the file my Rfunctions.R. The crossValid() function takes the followingarguments:1. Data: is a data frame containing all the variables in consideration2. y: name of target variable (in our case BAD). The default is to consider the first column as the target.3. folds: Allocation of rows to each of the K folds. This option is useful if you want to reproduce results by usingthe same cross-validation partitions. The default option is for this to be empty (so that the function specifies thepartitions)4. K: Instead of specifying the folds you can specify the number of partitions to be used. (The default value is 10)5. nvar: Maximum number of terms to include in the model. (The default behaviour is to consider all variables inData)We next call the main function:# Call cross validation with pre-specified folds> out <- crossValid(Data.train, y=”BAD”, folds = my folds)The function prints at each stage the number of estimated coecients, the variable added at each step, and the averageAUC over the K folds (see Fig. 10). Notice that CV is much more time-consuming than the information criteria you usedpreviously.In addition to estimating the AUC for models with dierent number of parameters, crossValid() also provides arecommendation for the best model. It is important to understand that the proposed, best, model is not the modelwith the highest average AUC. To improve the stability of cross validation it has been argued that the recommended(best) model should be the model with the smallest number of predictors that achieves an average performance withinone standard deviation from the model with the highest average AUC. In particular, if m? is the highest average AUCachieved by any model, and s? is the standard deviation of AUC by the model achieving m?, then the recommended modelis the one with the least parameters such that its average AUC is greater or equal to m? s?. This approach is known asthe 1 Standard Deviation Rule, and is used to improve the stability of CV.The output of the crossValid() function is a list containing the following:Figure 10: Output from crossValid()1. BestModel The best model identified using the one-standard deviation rule just described.2. Perf A matrix containing the AUC on each fold for the best model at each step3. Models Contains the specification of the best model at each stepTo inspect the recommended model we can use the summary() function (see Fig. 11). Note that the order in whichthe variables appear is the order in which they were inserted in the model. The same holds for the output of the coef()function below.> summary(out$BestModel)

# Coecients of best model
> coef(out$BestModel)

Rather than just looking at the recommended model only it is instructive to inspect how performance is aected as the
model size increases. To this end, use the function plotCVauc() also contained in my Rfunctions.R. The input for this
function must be the output of the crossValid() function.

> plotCVauc(out)

The horizontal axis of the resulting figure (see Fig. 12) is the number of predictors (independent variables) in each model
(0 refers to the model containing only the intercept). The vertical axis reports AUC: The dot for each number of predictors
represents the average AUC over the K folds, while the blue line represents one standard deviation from the mean. The
selected model is indicated with a vertical line, and as we saw previously it contains 8 predictors. The horizontal line is
the performance threshold given by the highest average AUC attainable minus one standard deviation. In this case the
highest average AUC is achieved by a model containing 19 predictors.

Fig. 12 indicates that the first few predictors increase the AUC very substantially but further additions induce very
marginal improvement. Indeed with more than 8 or 10 predictors the dierence in average AUC is extremely small.
Although barely noticeable, average performance actually deteriorates as we approach the maximum number of predictors.

Cross-validation can be sensitive to the specific partition used. You can repeat this experiment changing the allocation
of observations to folds and see what happens. You can also try dierent values for K and assess whether variability
increases or not.

Figure 11: Summary of best model from cross validation (order of coecients reflects order of inclusion in model)

0 5 10 15 20

Subset Size

C
V

A
U

C

0.
6

0.
8

Figure 12: AUC of models selected through 10-fold cross validation.

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] Bayesian Preliminary
$25