hw6
CS/INFO 3300; INFO 5100
Homework 6
Due 11:59pm Friday April 28
Linear Regression, and the intuitive meaning of the t-test
or
If I secretly randomized some aspect of your data set, would you notice?
Simple 2D linear regression takes, as input, an array of pairs of variables. Each pair consists
of an input x and an output y. The model says that there is a linear relationship between
the input and output: each time we increase the input by 1.0, we expect the output to
change by slope. But how can we be sure there really is a relationship? Or in other words,
how can we be sure that slope isnt 0.0? The linear regression function will always give us a
value, but can we trust it?
This question is what statisticians mean when they ask whether a value is significant. One
way to evaluate the significance of a slope parameter is to use a t-test. We calculate a
function of the absolute value of the slope, and pass that value to a piece of code that
calculates the tail probability of a Student t distribution. The result is your p-value. Ive
included code used to calculate that p-value. We talked about p-values in class. You may
have seen them before. When you run a linear regression in R or SAS, for each parameter
(ie slope, intercept) you get a number that is between 0 and 1. Values less than 0.05 are
supposed to be good. But what do these numbers really mean?
P-values often seem mysterious. The way they are calculated is, as you can see from the
pValue() code, incredibly confusing. You will compare the p-value you get from this code
to the value we get using a permutation test like we did in class. You will edit code in the file
hw6.html. When you open it and hit the Run button, the script will generate 10 points
from a linear model, plot the points, and plot the linear regression line for those points. It
will also show you a p-value for the slope. Now we need to figure out what this value
means.
Remember that the linear model is testing whether there is a systematic relationship
between the input and output variable. A permutation test tells us what would happen if
there were no relationship between these variables by creating fake datasets that have the
same x and y values as the original data but where there is no relationship between the x
and y values. We then compare the model we actually got from the real data to the possible
models we could have gotten from randomly shuffled variations of the same data.
a. Create a function that randomly shuffles the values of the y variable. You can start with
the provided stub of a function called permute(), you will fill in the body of this function.
In this function create a new array of x,y pairs. The x values should appear in the same
order as the x values in the input array points, but the y values should be randomly
shuffled. In other words, each y value in the input array should appear the same number of
times in the output array, but it may or may not be paired with its correct x value. (Class
notes may be of interest.) (10 pts)
b. Inside the run() function, create 200 random permutations of the real data. For each
one, compute a linear model from that permuted data. Call drawLine() with a different
color or opacity value from the real-data line. Check whether the absolute value of the
permuted-data slope is larger than the absolute value of the real-data slope. Set
steeperSlopes equal to the total number of permutations for which this condition is true.
(20 pts)
c. The code will now print half the number of permutations that had a more extreme (ie
steeper) slope than the real slope from the original points array. Hit Run a few times.
How does this permutation test value compare to the t-test p-value? (Use the free-response
d. Now explore this relationship more systematically. Create a second SVG element with an
x-axis for p-values and a y-axis for permutation test values. Both scales should go from
zero to one half. Add a diagonal line from (0,0) to (0.5, 0.5). (15 pts)
e. The code displays two values each time you hit Run, a p-value and a permutation test
value. Plot a circle on the second SVG element whose x position is the p-value and whose y
position is the permutation test value. These points should accumulate as you keep hitting
Run, do not erase them. (15 pts)
f. How do these two tests compare? What happens when you change the parameters
passed to generateLinear? Turn in your work using the default settings, but experiment
with using more points, a steeper slope, and greater noise. Make sure to consider the
special case where there is no relationship (slope = 0.0). What does the first plot look like
with different model settings? How do the p-value and permutation results change? (Use
the free-response
g. The value p < 0.05 is often considered special. Generate points with slope = 0.02 and noise = 0.3. Roughly how many points do you need to generate to get p-values consistently (i.e. more than 75% of the time) less than 0.05? Now generate with slope = 0.2 and noise = 0.3. Roughly how many points can you generate and not get p-values consistently less than 0.05? (10 pts)h. How many possible permutations of the y-values exist for 10 data points? Can we really just sample 200 of them? Change the number of permutations and describe how different values affect the relationship between the p-value and permutation results. (10 pts)EXTRA. Why do we need to divide the ratio of steeper slopes by half? (0 pts)
Reviews
There are no reviews yet.