This short programming assignment is designed to help you get an understanding for the basics of power and association computations from a computational persepctive. You can use any language for this project, though Python and R are recommended. You will need to submit your code along with your results file through CCLE.
Association Study at a Single SNP
Reading the input
The data for this programming assignment consists of a matrix of 2,000 individuals, 100,000 SNPs, and one column representing the individuals status as a case or control. Unzip the file SNP_status.zip, and youll see a space-separated text file. Each column of the input represents a single SNP, and each row represents an individual. Each element in the matrix represents the number of copies of a single SNP an individual has.
To get started, you may want to look at the documentation for the Python module pandas read_csv function, or the documentation of Rs read.table function.
If you arent sure where that is, literally copy and paste the module and method youre into Google.
That trick increases beginning computer programmer productivity by about 50 percent.
Part A Computing Association Statistics
Your job will be to compute the association statistic, the non-centrality parameter, and the corresponding p-value for each SNP.
You should use the formulas that we have developed in class to compute association statistics. Use the statistic:
Once you have the association statistic, use the normal quantile function to compute p-values. Use a two-tailed test for significance; report the p-values as described in Part D.
Part B Bonferroni Correction
Bonferroni correction is a way to maintain the family-wide error rate (FWER) of a set of related tests. If we want perform m tests of significance level on data that conform to the null hypothesis, we would expect to find m tests passing that level of significance on average, under the null. When m is large To combat this, we make each test pass a more stringent threshold of.
For this assignment, use = .05 for your two-tailed test. That is, a test is considered significant if and only if it is within the upper or lower 2.5% tails of the standard normal distribution.
Apply Bonferroni correction to your results from Part A and report the significantlyassociated SNPs as described in Part D..
Part C Q-Q plots and Inflation Statistics (Optional for Undergrads, Required for Grads)
Recall that a p-value is the probability that an association statistic of a particular magnitude will appear in the data. Given that definition, we should expect that only 30% of the individual hypotheses that we test should have p-values should be stronger than .3. If we see more than 30% of our statistics with p-values stronger than .3, that indicates our statistical test may be detecting many false positives.
However, it is often the case that we see tests with significantly increased or decreased p-values with versus our expectation. One way that this increase or decrease is quantified is using a statistic called gc, or the genomic control statistic.
Median empirical 2 statistic gc =
Median value of distribution
Using Python or R, find the median value of the chi-square (2) distribution with one degree of freedom.
You can compute (2)-distributed statistics by squaring the statistics you generated in Part A.
Compute gc for your study, and report it as described in part D.
There are many tutorials on how to make a Q-Q plot online; if you are new to Python or R plotting, it is recommended that you adapt code from one of these to suit this purpose. If youre stuck, Google QQ Plot R or QQ Plot Python and read the documentation and tutorials.
There are several ways to make a Q-Q plot; what they all have in common is that the x-value is the expected value of a statistic for a particular quantile, and the y-axis is the experimental value of the statistic at that quantile.
Figure 1: A Q-Q plot of p-values, from http://www.gettinggeneticsdone.com/2011/04/ annotated-manhattan-plots-and-qq-plots.html
Some statistics you can use include your SA statistics from part A, which are distributed as a standard normal under the null hypothesis, the 2 statistics you computed to compute gc, which are distributed as a chi-square with one degree of freedom under the null, and the p-values, which should be distributed uniformly between 0 and 1 under the null. If you use p-values, it is generally more informative to show their negative log base 10. If you show the raw p-values, we will not be able to see the interesting very small p-values, since their numerical values will be clustered around 0.
Make an image of a Q-Q plot of your data and post it in the forums. Try to explain any trends you see.
Part D Output Format
The most important part of this assignment is that you input and output data in the proper format. The first line should be your UID; it is recommended that you also put your UID in the title of your file, but its not required.
UID:{Your UID} email:{Your email}
Undergrad or Grad:{Grad if youre a graduate student, undergrad otherwise} <A>
{SNPNAME}:{RAW-P-VALUE}
{SNPNAME}:{RAW-P-VALUE}
{SNPNAME}:{RAW-P-VALUE}
</A>
<B>
{SIGNIFICANT-SNP1}
{SIGNIFICANT-SNP2}
</B>
<C>
Lambda_gc:<Lambda Value>
</C>
If I were to submit an assignment, my output would look something like this:
UID:123456789 email:[email protected]
Undergrad or Grad:Grad
<A>
SNP0000:0.175
SNP0001:0.875
SNP0002:0.0003
</A>
<B>
SNP0002
</B>
<C>
Lambda_gc:1.872
</C>
Reviews
There are no reviews yet.