String Algorithms
Jaehyun Park CS 97SI
Stanford University June 30, 2015
String Matching Problem
Hash Table
Knuth-Morris-Pratt (KMP) Algorithm Suffix Trie
Suffix Array
Outline
String Matching Problem 2
String Matching Problem
Given a text T and a pattern P, find all occurrences of P within T
Notations:
nandm: lengthsofP andT
: set of alphabets (of constant size) Pi: ith letter of P (1-indexed)
a, b, c: single letters in
x, y, z: strings
String Matching Problem 3
Example
T = AGCATGCTGCAGTCATGCTTAGGCTA P = GCT
P appears three times in T
A naive method takes O(mn) time
Initiate string comparison at every starting point Each comparison takes O(m) time
We can do much better!
String Matching Problem 4
String Matching Problem
Hash Table
Knuth-Morris-Pratt (KMP) Algorithm Suffix Trie
Suffix Array
Outline
Hash Table 5
Hash Function
A function that takes a string and outputs a number A good hash function has few collisions
i.e., If x = y, H(x) = H(y) with high probability
An easy and powerful hash function is a polynomial mod some prime p
Consider each letter as a number (ASCII value is fine)
H(x1 xk) = x1ak1 +x2ak2 ++xk1a+xk (mod p) How do we find H(x2 xk+1) from H(x1 xk)?
Hash Table 6
Hash Table
Main idea: preprocess T to speedup queries Hash every substring of length k
k is a small constant
For each query P , hash the first k letters of P to retrieve all the occurrences of it within T
Dont forget to check collisions!
Hash Table 7
Pros:
Easy to implement
Hash Table
Significant speedup in practice
Cons:
Doesnt help the asymptotic efficiency
Can still take (nm) time if hashing is terrible or data is difficult
A lot of memory consumption
Hash Table 8
Outline
String Matching Problem
Hash Table
Knuth-Morris-Pratt (KMP) Algorithm Suffix Trie
Suffix Array
Knuth-Morris-Pratt (KMP) Algorithm 9
Knuth-Morris-Pratt (KMP) Matcher
A linear time (!) algorithm that solves the string matching problem by preprocessing P in (m) time
Main idea is to skip some comparisons by using the previous comparison result
Uses an auxiliary array that is defined as the following:
[i] is the largest integer smaller than i such that P1 . . . P[i] is
a suffix of P1 Pi
Its better to see an example than the definition
Knuth-Morris-Pratt (KMP) Algorithm 10
Table Example (from CLRS)
[i]isthelargestintegersmallerthanisuchthatP1P[i] is a suffix of P1 Pi
e.g., [6] = 4 since abab is a suffix of ababab
e.g., [9] = 0 since no prefix of length 8 ends with c
Lets see why this is useful
Knuth-Morris-Pratt (KMP) Algorithm 11
Using the Table
T = ABC ABCDAB ABCDABCDABDE
P = ABCDABD
= (0,0,0,0,1,2,0)
Start matching at the first position of T:
Mismatchatthe4thletterofP!
Knuth-Morris-Pratt (KMP) Algorithm 12
Using the Table
Wematchedk=3letterssofar,and[k]=0
Thus, there is no point in starting the comparison at T2, T3
(crucial observation)
ShiftP byk[k]=3letters
Mismatch at T4 again!
Knuth-Morris-Pratt (KMP) Algorithm 13
Mismatch at T11!
Using the Table
We matched k = 0 letters so far
ShiftP byk[k]=1letter(wedefine[0]=1)
Knuth-Morris-Pratt (KMP) Algorithm 14
Using the Table
[6]=2meansP1P2 isasuffixofP1P6
ShiftP by6[6]=4letters
Again, no point in shifting P by 1, 2, or 3 letters Knuth-Morris-Pratt (KMP) Algorithm 15
Mismatch at T11 again!
Currently 2 letters are matched ShiftP by2[2]=2letters
Using the Table
Knuth-Morris-Pratt (KMP) Algorithm 16
Mismatch at T11 yet again!
Currently no letters are matched ShiftP by0[0]=1letter
Using the Table
Knuth-Morris-Pratt (KMP) Algorithm 17
Mismatch at T18
Currently 6 letters are matched ShiftP by6[6]=4letters
Using the Table
Knuth-Morris-Pratt (KMP) Algorithm 18
Using the Table Finally, there it is!
Currently all 7 letters are matched
After recording this match (at T16 . . . T22, we shift P again in order to find other matches
Shiftby7[7]=7letters
Knuth-Morris-Pratt (KMP) Algorithm 19
Computing
Observation 1: if P1 P[i] is a suffix of P1 Pi, then P1 P[i]1 is a suffix of P1 Pi1
Well, obviously
Observation 2: all the prefixes of P that are a suffix of P1 . . . Pi can be obtained by recursively applying to i
e.g., P1 P[i], P1 ,P[[i]], P1 ,P[[[i]]] are all suffixes of P1 Pi
Knuth-Morris-Pratt (KMP) Algorithm 20
Computing
A non-obvious conclusion:
First, lets write (k)[i] as [] applied k times to i
e.g., (2)[i] = [[i]]
[i] is equal to (k)[i 1] + 1, where k is the smallest integer
that satisfies P(k)[i1]+1 = Pi
Ifthereisnosuchk,[i]=0
Intuition: we look at all the prefixes of P that are suffixes of P1 . . . Pi1, and find the longest one whose next letter matches Pi
Knuth-Morris-Pratt (KMP) Algorithm 21
Implementation
pi[0] = -1;
int k = -1;
for(int i = 1; i <= m; i++) {while(k >= 0 && P[k+1] != P[i])
k = pi[k];
pi[i] = ++k; }
Knuth-Morris-Pratt (KMP) Algorithm 22
Pattern Matching Implementation
int k = 0;
for(int i = 1; i <= n; i++) {while(k >= 0 && P[k+1] != T[i])
k = pi[k];
k++;
if(k == m) {
// P matches T[i-m+1..i]
k = pi[k]; }
}
Knuth-Morris-Pratt (KMP) Algorithm 23
Outline
String Matching Problem
Hash Table
Knuth-Morris-Pratt (KMP) Algorithm Suffix Trie
Suffix Array
Suffix Trie 24
Suffix Trie
Suffix trie of a string T is a rooted tree that stores all the suffixes (thus all the substrings)
Each node corresponds to some substring of T
Each edge is associated with an alphabet
For each node that corresponds to ax, there is a special pointer called suffix link that leads to the node corresponding to x
Surprisingly easy to implement!
Suffix Trie 25
Suffix Trie
26
Example
(Figure modified from Ukkonens original paper)
Incremental Construction
Given the suffix tree for T1 Tn
Then we append Tn+1 = a to T , creating necessary nodes
StartatnodeucorrespondingtoT1Tn Create an a-transition to a new node v
Take the suffix link at u to go to u, corresponding to T2 Tn
Create an a-transition to a new node v Create a suffix link from v to v
Suffix Trie 27
Suffix Trie
28
Incremental Construction
Repeat the previous process:
Take the suffix link at the current node
Make a new a-transition there
Create the suffix link from the previous node
Stop if the node already has an a-transition
Because from this point, all nodes that are reachable via suffix links already have an a-transition
Suffix Trie
29
Construction Example
Given the suffix trie for aba We want to add a new letter c
Suffix Trie
30
Construction Example
Suffix Trie
31
Construction Example
Suffix Trie
32
Construction Example
Suffix Trie
33
Construction Example
Suffix Trie
34
Construction Example
Construction Example
Construction time is linear in the tree size But the tree size can be quadratic in n
e.g., T = aaabbb
Suffix Trie 35
Construction Example
To find P, start at the root and keep following edges labeled with P1, P2, etc.
Got stuck? Then P doesnt exist in T
Suffix Trie 36
Outline
String Matching Problem
Hash Table
Knuth-Morris-Pratt (KMP) Algorithm Suffix Trie
Suffix Array
Suffix Array 37
Suffix Array
38
Suffix Array
Suffix Array
39
If you want to see how it works, read the paper on the course website http://cs97si.stanford.edu/suffix-array.pdf
Suffix Array
Memory usage is O(n)
Has the same computational power as suffix trie Can be constructed in O(n) time (!)
But its hard to implement
ThereisanapproachableO(nlog2n)algorithm
Notes on String Problems
Always be aware of the null-terminators
Simple hash works so well in many problems
If a problem involves rotations of some string, consider concatenating it with itself and see if it helps
Stanford team notebook has implementations of suffix arrays and the KMP matcher
Suffix Array 40
Reviews
There are no reviews yet.