[SOLVED] CS computational biology algorithm Lecture 7:

$25

File Name: CS_computational_biology_algorithm_Lecture_7:.zip
File Size: 423.9 KB

5/5 - (1 vote)

Lecture 7:
Dynamic Programming II (Adv)
The University of Sydney
Page 1
Fast Fourier Transform

How does Google know what I want?
The University of Sydney Page 4

Edit distance
How similar are two strings? S = ocurrance
T = occurrence
o
c

u
r
r
a
n
c
e
o
c
c
u
r
r
e
n
c
e
1 mismatch, 1 gap
Modify S to get T by adding c, swapping a-e The University of Sydney
Page 5

Which edit is better?
o
c

u
r
r
a
n
c
e
o
c
c
u
r
r
e
n
c
e
1 mismatch, 1 gap
o
c

u
r
r

a
n
c
e
o
c
c
u
r
r
e

n
c
e
0 mismatches, 3 gaps
o
c
u
r
r
a
n
c
e

o
c
c
u
r
r
e
n
c
e
The University of Sydney
5 mismatches, 1 gap
Page 8

Edit Distance
Applications.
Basis for Unix diff.
Speech recognition.
Computational biology.
Edit distance. [Levenshtein 1966, Needleman-Wunsch 1970] Gap penalty d and mismatch penalty apq.
Cost = sum of gap and mismatch penalties.
C
T
G
A
C
C
T
A
C
C
T

C
T
G
A
C
C
T
A
C
C
T
C
C
T
G
A
C
T
A
C
A
T
C
C
T
G
A
C

T
A
C
A
T
aTC + aGT + aAG+ 2aCA
The University of Sydney Page 9
2d + aCA

Sequence Alignment
Goal: Given two strings X = x1 x2 . . . xm and Y = y1 y2 . . . yn find alignment of minimum cost.
Definition: An alignment M is a set of ordered pairs xi-yj such that each item occurs in at most one pair and no crossings.
Definition: The pair xi-yj and xa-yb cross if i < a, but j > b.
Example: CTACCG vs. TACATG.
x1 x2 x3 x4 x5 x6
C
T
A
C
C

G

T
A
C
A
T
G
The University of Sydney M = x2-y1, x3-y2, x4-y3, x5-y4, x6-y6. Page 10
y1 y2 y3 y4 y5 y6

Sequence Alignment
cost(M) = Saxi,yj + Sd + Sd (xi,yj)IM i: xi unmatched j: yj unmatched
mismatch Example: CTACCG vs. TACATG.
x1 x2 x3 x4 x5 x6
C
T
A
C
C

G

T
A
C
A
T
G
The University of Sydney
Page 11
y1 y2 y3 y4 y5 y6
M = x2-y1, x3-y2, x4-y3, x5-y4, x6-y6.

Key steps: Dynamic programming
1. Define subproblems 2. Find recurrences
3. Solve the base cases
4. Transform recurrence into an efficient algorithm
The University of Sydney Page 12

Sequence Alignment: Problem Structure Step 1: Define subproblems
OPT(i, j) =
min cost of aligning strings x1 x2 . . . xi and y1 y2 yj.
x1 x2 x3 x4 x5 xi

C
T
A
C
C

G

T
A
C
A
T
G
The University of Sydney
Page 14
y1 y2 y3 y4 y5 yj

Sequence Alignment: Problem Structure
Definition: OPT(i, j) = min cost of aligning strings x1 x2 . . . xi and y1 y2 yj.
Case 1: OPT matches xi-yj.
pay mismatch for xi-yj + min cost of aligning
twostringsx1 x2 xi-1 andy1 y2 yj-1 Case 2a: OPT leaves xi unmatched.
paygapforxi andmincostofaligningx1 x2 xi-1 andy1 y2 yj Case 2b: OPT leaves yj unmatched.
paygapforyj andmincostofaligningx1 x2 xi andy1 y2 yj-1
OPT(i, j) = min
axiyj + OPT(i-1,j-1) d + OPT(i-1, j)
d + OPT(i, j-1)
The University of Sydney Page 19

Sequence Alignment: Problem Structure Step 3: Solve the base cases
OPT(0,j) = jd and OPT(i,0) = id
OPT(i,j) = jd if i=0 OPI(i, j) = OPT(i,j) = id if j=0
min{axiyj + OPT(i-1, j-1),d + OPT(i-1, j),d + OPT(i, j-1)} otherwise
The University of Sydney
Page 20

Sequence Alignment: Algorithm
Sequence-Alignment(m, n, x1x2xm, y1y2yn, d, a) { for i = 0 to m
M[0, i] = id for j = 0 to n
M[j, 0] = jd
for i = 1 to m
for j = 1 to n
M[i, j] = min{a[xi, yj] + M[i-1, j-1], d + M[i-1, j],
return M[m, n]
}
d + M[i, j-1]}
Analysis. Q(mn) time and space.
English words or sentences: m, n 100.
Computational biology: m = n = 100,000. 10 billions ops OK, but 10GB array?
The University of Sydney
Page 21

Sequence Alignment: Algorithm
Sequence-Alignment(m, n, x1x2xm, y1y2yn, d, a) { for i = 0 to m
M[0, i] = id for j = 0 to n
M[j, 0] = jd
for i = 1 to m
for j = 1 to n
M[i, j] = min{a[xi, yj] + M[i-1, j-1], d + M[i-1, j],
return M[m, n]
}
d + M[i, j-1]}
To get the alignment itself we can trace back through the array M.
The University of Sydney Page 22

Sequence Alignment: Linear Space
Question: Can we avoid using quadratic space?
Easy. Optimal value in O(m + n) space and O(mn) time.
Compute OPT(i, j) from OPT(i-1, j), OPT(i-1, j-1) and OPT(i, j-1). BUT! No longer a simple way to recover alignment itself.
Theorem: [Hirschberg 1975] Optimal alignment in O(m + n) space and O(mn) time.
Clever combination of divide-and-conquer and dynamic programming. Inspired by idea of Savitch from complexity theory.
The University of Sydney Page 25

Sequence Alignment: Linear Space
Edit distance graph.
m n grid graph GXY (as shown in the figure)
Horizontal/vertical edges have cost d (gap)
Diagonal edges from (i-1, j-1) to (i,j) have cost axiyj. (match)
e y1 y2 y3 y4 y5 y6 0-0
axiyj d
d i-j
e
x1 x2
x3
m-n
The University of Sydney
Page 26

Sequence Alignment: Linear Space
Edit distance graph.
Let f(i, j) be cheapest path from (0,0) to (i, j). Observation: f(i, j) = OPT(i, j).
e y1 y2 y3 y4 y5 y6 0-0
axiyj d
d i-j
e
x1 x2
x3
m-n
The University of Sydney
Page 27

Sequence Alignment: Linear Space
min{axiyj + OPT(i-1, j-1),d + OPT(i-1, j),d + OPT(i, j-1)}
e
x1 x2
x3
e y1 y2 y3 y4 y5 y6 0-0
axiyj d
d i-j
m-n
The University of Sydney
Page 28

Sequence Alignment: Linear Space
Edit distance graph.
Let f(i, j) be cheapest path from (0,0) to (i, j).
Can compute f(m,n) in O(mn) time and O(mn) space.
j
y4
i-j
e y1 y2 y3 0-0
x1 x2
y5 y6
e
x3
m-n
The University of Sydney
Page 29

Sequence Alignment: Linear Space
Edit distance graph.
Let f(i, j) be cheapest path from (0,0) to (i, j).
If only interested in the value of the optimal alignment we do it in O(mn)
time and O(m + n) space.
e y1 y2 y3 0-0
x1 x2
j
y4
i-j
y5 y6
e
x3
m-n
The University of Sydney
Page 30

Sequence Alignment: Linear Space
Edit distance graph.
Let f(i, j) be cheapest path from (0,0) to (i, j).
This corresponds to optimal alignment of X[1..i] and Y[1..j]
For any column j, can compute f(, j) in O(mn)j time and O(m + n) space
y4
i-j
e y1 y2 y3 0-0
x1 x2
y5 y6
e
x3
m-n
The University of Sydney
Page 31

Sequence Alignment: Linear Space
Edit distance graph.
Let f(i, j) be cheapest path from (0,0) to (i, j).
If only interested in the value of the optimal alignment we do it in O(mn) time and O(m + n) space.
Space-Efficient-Alignment(X,Y) {
array B[0..m,0..1]
for i = 0 to m
B[i,0] = id for j = 1 to n B[0,1] = jd
#Collapse A into an m 2 array] # B[i,0] = A[i,j-1]
# B[i,1] = A[i,j]
#corresponds to A[0,j]
for i = 1 to m
B[i,1] = min(a[xi, yj] + B[i-1,0],
d + B[i-1,1],d + B[i,0])
endFor
Move column i of B to column 0 #(B[i,0]=B[i,1]) endFor
}
The University of Sydney Page 32

Sequence Alignment: Linear Space
Edit distance graph.
Let f(i, j) be length of cheapest path from (0,0) to (i, j).
This corresponds to cost of optimal alignment of X[1..i] and Y[1..j] Observation: f(i, j) = OPT(i, j).
e y1 y2 y3 y4 y5 y6 0-0
e
x1 x2
d d
i-j
x3
m-n
The University of Sydney
Page 33

Sequence Alignment: Linear Space
Edit distance graph.
Let g(i, j) be cheapest path from (i, j) to (m, n).
This corresponds to optimal alignment of X[m..i+1] and Y[m..j+1]
Can compute by reversing the edge orientations and inverting the roles of (0, 0) and (m, n)
0-0
x1
x2 x3
ePage 34
i-j d
d
The University of Sydney m-n y1 y2 y3 y4 y5 y6 e

Sequence Alignment: Linear Space
Edit distance graph.
Let g(i, j) be cheapest path from (i, j) to (m, n).
For any column j, can compute g(, j) for all j in O(mn) time and O(m +
n) space.
e y1 y2 0-0
x1 x2
j
y3
i-j
y4 y5 y6
e
x3
m-n
The University of Sydney
Page 35

Sequence Alignment: Linear Space
Observation 1: The cost of the cheapest path that uses (i, j) is f(i, j) + g(i, j).
e y1 y2 y3 y4 y5 y6 0-0
x1 i-j x2
e
x3
m-n
The University of Sydney
Page 36

Sequence Alignment: Linear Space
Observation 2: Let q be an index that minimizes f(q, n/2) + g(q, n/2). Then, the cheapest path from (0, 0)
e y1 y2 0-0
x1 x2
y4 y5 y6
to (m, n) uses (q, n/2).
n/2
y3
i-j
e
q
x3
m-n
The University of Sydney
Page 37

Sequence Alignment: Linear Space
Divide: Find index q that minimizes f(q, n/2) + g(q, n/2) using DP. Can be done in O(nm) time and O(n + m) space
Align xq and yn/2.
Conquer: recursively compute optimal alignment in each piece.
e y1 y2 0-0
x1
e
y3
q,n/2
n/2
m-n
y4 y5 y6
q
x2
x3
The University of Sydney
Page 38

Pseudocode
Divide-and-Conquer alignment(X,Y) { If |X|2 or |Y|2 then
OptimalAlignment(X,Y) #Alg using quadratic space f(,n/2)=Space-Efficient-Alignment(X,Y[1..n/2]) g(,n/2)=Backward-S-E-Alignment(X,Y[n/2..n])
Let q be the index minimizing f(q,n/2)+g(q,n/2)
A = Divide-and-Conquer alignment(X[1..q],Y[1..n/2])
B = Divide-and-Conquer alignment(X[m..q+1],Y[n..n/2+1]) return concatenation of A and B
}
n/2
q
The University of Sydney
Page 39

Sequence Alignment: Running Time Analysis Warmup
Theorem: Let T(m, n) = max running time of algorithm on strings of length at most m and n. T(m, n) = O(mn log n).
T(m, n) 2T(m, n/2) + O(mn) T(m, n) = O(mn logn)
Remark: Analysis is not tight because two subproblems are of size (q, n/2) and (m q, n/2).
The University of Sydney
Page 40

Sequence Alignment: Running Time Analysis
Theorem: Let T(m, n) = max running time of algorithm on strings of length m and n. T(m, n) = kmn for some constant k > 0.
Proof: (by induction on n)
O(mn)timetocomputef(,n/2)andg(,n/2)andfindindexq. T(q, n/2) + T(m q, n/2) time for two recursive calls.
The University of Sydney
Page 41

Sequence Alignment: Running Time Analysis
Theorem: Let T(m, n) = max running time of algorithm on strings of length m and n. T(m, n) = kmn for some constant k > 0.
Proof: (by induction on n)
O(mn)timetocomputef(,n/2)andg(,n/2)andfindindexq. T(q, n/2) + T(m q, n/2) time for two recursive calls.
For some constant c we have:
T(m,2) cm
T(2, n) cn
T(m, n) cmn+T(q, n/2)+T(m-q, n/2)
The University of Sydney
Page 42

Sequence Alignment: Running Time Analysis
Theorem: Let T(m, n) = max running time of algorithm on strings of length m and n. T(m, n) = kmn for some constant k > 0.
Proof: (by induction on n)
O(mn)timetocomputef(,n/2)andg(,n/2)andfindindexq. T(q, n/2) + T(m q, n/2) time for two recursive calls.
For some constant c we have:
Basecases:m2orn2.
Inductive hypothesis: T(m, n) 2cmn for m and n with mn < mn. T(m,2) cmT(2, n) cnT(m, n) cmn+T(q, n/2)+T(m-q, n/2) T(m,n) T(q,n/2)+T(m-q,n/2)+cmn 2cqn/2+2c(m-q)n/2+cmn= cqn+cmn-cqn+cmn = 2cmnPage 43The University of Sydney Sequence Alignment: Running Time AnalysisTheorem: An optimal alignment can be computed in O(mn) time using O(m+n) space.The University of SydneyPage 44 Sequence Alignment: History [m=n] Needleman and Wunsch 1970 O(n3) Sankoff 1972 O(n2)[see also Vintsyuk68 for speech processingWagner and Fisher74 for string matching] Still an active research area (experimental research) Chakraborty and Angana13 (claimed 54-90% speedup)The University of SydneyPage 45 Generalising the algorithmProblem:Nature often inserts or removes entire substrings of nucleotides (creating long gaps), rather than editing just one position at a time.The penalty for a gap of length 10 should not be 10 times the penalty for a gap of length 1, but something significantly smaller. Can we modify the scoring function in which the penalty for a gap of length k is:d0+d1k ?The University of SydneyPage 46 Dynamic Programming Summary 1Ddynamicprogramming Weightedintervalscheduling SegmentedLeastSquares Maximum-sumcontiguoussubarray Longestincreasingsubsequence 2Ddynamicprogramming Knapsack Sequencealignment Dynamicprogrammingoverintervals RNASecondaryStructure Dynamicprogrammingoversubsets TSP k-path PlaylistThe University of SydneyPage 47

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] CS computational biology algorithm Lecture 7:
$25