Assignment 2. Due: through course web (Moodle) on 9/22/2016, before 11:55 PM. Note: See the course web page for the late turn-in policy.
Note: Collaboration policy: see Basic Information link on the course web page for more details.
Problem 1. Sequence Alignment. 15 points.
Calculate and show the Dynamic Programming matrix and an optimal alignment for the DNA sequences GCTTAGC and GCATTGC, scoring +3 for a match, -2 for a mismatch, and with a gap penalty of 3 (i.e., each gap column contributes -3). (If there are more than one optimal alignments, report any of those.)
Problem 2. Sequence Alignment. 15 points.
Find the optimal pairwise global alignment of the sequences ATGACCATGAC and ACTGATCACTGAT with the condition that the C nucleotides shown in bold font
must be aligned with each other. What is the optimal alignment score? The scoring parameters are defined as +2 for match, -1 for mismatch and -2 for gap. Show the dynamic programming matrix you used. (And please do not create a 1113 matrix!)
Problem 3. Pattern Matching. 15 points.
Build a keyword tree, with fail edges (as in the Aho-Corasick algorithm discussed in class) for the following dictionary. Mark with dashed arrows the fail edges that do not go back to the root. That is, you do not need to show fail edges going back to the root. You may, but are not required to mark the nodes that correspond to pattern matches.
Count and note down the number of fail edges your keyword tree has. (Again, only the fail edges you marked, i.e., fail edges that do not go back to the root.)
FACED CEDER DERIVE RIVAL VALENCE
Problem 4. Sequence alignment. 50 points.
Note: you may use any programming language you wish. You are not allowed to use
any existing libraries or programs that are implemented to align sequences.
Your first task is to implement a dynamic programming algorithm that finds the optimal global alignment of two input DNA sequences. The program should take as input:
a. Two Fasta format files, each with one DNA sequence. See http://en.wikipedia.org/wiki/FASTA_format for help on this format. See http://veda.cs.uiuc.edu/courses/fa16/cs466/assign/assn2.file1.fa for a sample fasta format file. The program should be able to handle sequences of length 1000 bp each.
b. A text file containing a 4 x 4 matrix of substitution scores, to be used in the alignment (see sample below). You may use the following file as a sample to develop/debug your code: http://veda.cs.uiuc.edu/courses/fa16/cs466/assign/subs.txt
Example of a nucleotide substitution matrix:
ACGT A 91 -114 -31 -123 C -114 100 -125 -31 G -31 -125 100 -114 T -123 -31 -114 91
c. A negative real number representing the gap penalty.
The program will be run from the Linux command-line as:
<programfilename> <fastafilename1> <fastafilename2> <substitutionmatrixfile> <gap-penalty>
The output of the program should be a textual display of the optimal global alignment (any one, if more than one optimal solutions exist). In particular:
a. First, there should be a statement of the form: The optimal alignment between given sequences has score X.
b. Next, the alignment should be output in the format given by the following example:
ACCC--AC--AT-A...ATAC-CGGACATATTA...AT
The entire alignment should be in two lines, one line for each input sequence.
Note that since you will be implementing a dynamic programming alignment algorithm, we expect your program to be of quadratic time complexity (i.e., two sequences of length n can be aligned in O(n2) time).
Your second task is to write a program (called create_data_and_align below) that creates a pair of related sequences and aligns them using the alignment program from the first task.
The create_data_and_align program should take as input a single integer L. It should then create two related DNA sequences as follows:
- Create a random DNA sequence of length L, with uniform nucleotide probabilities. (I will discuss in class how to create a random sequence.) For now, we will call this sequence S1. Please use the current time to seed your pseudo-random number generator.
- Make int(L/10) random changes to sequence S1 to create sequence S2. To make a random change, randomly choose a position in the sequence, then mutate it to something else with probability 12 and delete that position (the single nucleotide at that position) with probability 12.
- Make int(L/10) random changes to sequence S1 to create sequence S3.
- Write sequences S2 and S3 from the previous steps into two separate Fasta
format files.
The create_data_and_align program should then call your alignment program from the first task above, and thus produce a new text file that has the optimal alignment of sequences S2 and S3. Call your alignment program using the subs.txt scoring matrix mentioned above and gap penalty = -500.
Your third task is to run the create_data_and_align program ten times, and store the resulting alignment files in files named aln1.txt, aln2.txt, aln10.txt. Each run should use L = 100. (and a gap penalty of -500)
Your fourth task is to run the alignment program you created in the first task on the two fasta format files included in the following downloadable tarball: http://veda.cs.uiuc.edu/courses/fa16/cs466/assign/assn2-alnproblem.tar.gz
Write your alignment in a file called mel-vir-aln.txt.
Use the substitution matrix file subs.txt linked to above, and a gap penalty of -200.
What to turn in for problem 4:
Create a tarball or zip archive of a directory containing the following files, and upload along with your solution to the assignment.
a. Files that have your code. (The alignment program, the create_data_and_align program.)
b. The files aln*.txt created from your third task above.
c. The alignment file mel-vir-aln.txt from your fourth task above.
Note: the TA is not going to run your programs, and you are going to submit the code only as part of our records that you did indeed implement them.
Reviews
There are no reviews yet.