String Algorithms
Jaehyun Park CS 97SI
Stanford University June 30, 2015
String Matching Problem
Hash Table
Knuth-Morris-Pratt (KMP) Algorithm Suffix Trie
Suffix Array
Outline
String Matching Problem 2
String Matching Problem
◮ Given a text T and a pattern P, find all occurrences of P within T
◮ Notations:
– nandm: lengthsofP andT
– Σ: set of alphabets (of constant size) – Pi: ith letter of P (1-indexed)
– a, b, c: single letters in Σ
– x, y, z: strings
String Matching Problem 3
Example
◮ T = AGCATGCTGCAGTCATGCTTAGGCTA ◮ P = GCT
◮ P appears three times in T
◮ A naive method takes O(mn) time
– Initiate string comparison at every starting point – Each comparison takes O(m) time
◮ We can do much better!
String Matching Problem 4
String Matching Problem
Hash Table
Knuth-Morris-Pratt (KMP) Algorithm Suffix Trie
Suffix Array
Outline
Hash Table 5
Hash Function
◮ A function that takes a string and outputs a number ◮ A good hash function has few collisions
– i.e., If x ̸= y, H(x) ̸= H(y) with high probability
◮ An easy and powerful hash function is a polynomial mod some prime p
– Consider each letter as a number (ASCII value is fine)
– H(x1 …xk) = x1ak−1 +x2ak−2 +···+xk−1a+xk (mod p) – How do we find H(x2 …xk+1) from H(x1 …xk)?
Hash Table 6
Hash Table
◮ Main idea: preprocess T to speedup queries – Hash every substring of length k
– k is a small constant
◮ For each query P , hash the first k letters of P to retrieve all the occurrences of it within T
◮ Don’t forget to check collisions!
Hash Table 7
◮ Pros:
– Easy to implement
Hash Table
– Significant speedup in practice
◮ Cons:
– Doesn’t help the asymptotic efficiency
◮ Can still take Θ(nm) time if hashing is terrible or data is difficult
– A lot of memory consumption
Hash Table 8
Outline
String Matching Problem
Hash Table
Knuth-Morris-Pratt (KMP) Algorithm Suffix Trie
Suffix Array
Knuth-Morris-Pratt (KMP) Algorithm 9
Knuth-Morris-Pratt (KMP) Matcher
◮ A linear time (!) algorithm that solves the string matching problem by preprocessing P in Θ(m) time
– Main idea is to skip some comparisons by using the previous comparison result
◮ Uses an auxiliary array π that is defined as the following:
– π[i] is the largest integer smaller than i such that P1 . . . Pπ[i] is
a suffix of P1 …Pi
◮ … It’s better to see an example than the definition
Knuth-Morris-Pratt (KMP) Algorithm 10
π Table Example (from CLRS)
◮ π[i]isthelargestintegersmallerthanisuchthatP1…Pπ[i] is a suffix of P1 …Pi
– e.g., π[6] = 4 since abab is a suffix of ababab
– e.g., π[9] = 0 since no prefix of length ≤ 8 ends with c
◮ Let’s see why this is useful
Knuth-Morris-Pratt (KMP) Algorithm 11
Using the π Table
◮ T = ABC ABCDAB ABCDABCDABDE
◮ P = ABCDABD
◮ π = (0,0,0,0,1,2,0)
◮ Start matching at the first position of T:
◮ Mismatchatthe4thletterofP!
Knuth-Morris-Pratt (KMP) Algorithm 12
Using the π Table
◮ Wematchedk=3letterssofar,andπ[k]=0
– Thus, there is no point in starting the comparison at T2, T3
(crucial observation)
◮ ShiftP byk−π[k]=3letters
◮ Mismatch at T4 again!
Knuth-Morris-Pratt (KMP) Algorithm 13
◮ Mismatch at T11!
Using the π Table
◮ We matched k = 0 letters so far
◮ ShiftP byk−π[k]=1letter(wedefineπ[0]=−1)
Knuth-Morris-Pratt (KMP) Algorithm 14
Using the π Table
◮ π[6]=2meansP1P2 isasuffixofP1…P6
◮ ShiftP by6−π[6]=4letters
◮ Again, no point in shifting P by 1, 2, or 3 letters Knuth-Morris-Pratt (KMP) Algorithm 15
◮ Mismatch at T11 again!
◮ Currently 2 letters are matched ◮ ShiftP by2−π[2]=2letters
Using the π Table
Knuth-Morris-Pratt (KMP) Algorithm 16
◮ Mismatch at T11 yet again!
◮ Currently no letters are matched ◮ ShiftP by0−π[0]=1letter
Using the π Table
Knuth-Morris-Pratt (KMP) Algorithm 17
◮ Mismatch at T18
◮ Currently 6 letters are matched ◮ ShiftP by6−π[6]=4letters
Using the π Table
Knuth-Morris-Pratt (KMP) Algorithm 18
Using the π Table ◮ Finally, there it is!
◮ Currently all 7 letters are matched
◮ After recording this match (at T16 . . . T22, we shift P again in order to find other matches
– Shiftby7−π[7]=7letters
Knuth-Morris-Pratt (KMP) Algorithm 19
Computing π
◮ Observation 1: if P1 …Pπ[i] is a suffix of P1 …Pi, then P1 …Pπ[i]−1 is a suffix of P1 …Pi−1
– Well, obviously…
◮ Observation 2: all the prefixes of P that are a suffix of P1 . . . Pi can be obtained by recursively applying π to i
– e.g., P1 …Pπ[i], P1 …,Pπ[π[i]], P1 …,Pπ[π[π[i]]] are all suffixes of P1 …Pi
Knuth-Morris-Pratt (KMP) Algorithm 20
Computing π
◮ A non-obvious conclusion:
– First, let’s write π(k)[i] as π[·] applied k times to i
– e.g., π(2)[i] = π[π[i]]
– π[i] is equal to π(k)[i − 1] + 1, where k is the smallest integer
that satisfies Pπ(k)[i−1]+1 = Pi
◮ Ifthereisnosuchk,π[i]=0
◮ Intuition: we look at all the prefixes of P that are suffixes of P1 . . . Pi−1, and find the longest one whose next letter matches Pi
Knuth-Morris-Pratt (KMP) Algorithm 21
Implementation
pi[0] = -1;
int k = -1;
for(int i = 1; i <= m; i++) {while(k >= 0 && P[k+1] != P[i])
k = pi[k];
pi[i] = ++k; }
Knuth-Morris-Pratt (KMP) Algorithm 22
Pattern Matching Implementation
int k = 0;
for(int i = 1; i <= n; i++) {while(k >= 0 && P[k+1] != T[i])
k = pi[k];
k++;
if(k == m) {
// P matches T[i-m+1..i]
k = pi[k]; }
}
Knuth-Morris-Pratt (KMP) Algorithm 23
Outline
String Matching Problem
Hash Table
Knuth-Morris-Pratt (KMP) Algorithm Suffix Trie
Suffix Array
Suffix Trie 24
Suffix Trie
◮ Suffix trie of a string T is a rooted tree that stores all the suffixes (thus all the substrings)
◮ Each node corresponds to some substring of T
◮ Each edge is associated with an alphabet
◮ For each node that corresponds to ax, there is a special pointer called suffix link that leads to the node corresponding to x
◮ Surprisingly easy to implement!
Suffix Trie 25
Suffix Trie
26
Example
(Figure modified from Ukkonen’s original paper)
Incremental Construction
◮ Given the suffix tree for T1 …Tn
– Then we append Tn+1 = a to T , creating necessary nodes
◮ StartatnodeucorrespondingtoT1…Tn – Create an a-transition to a new node v
◮ Take the suffix link at u to go to u′, corresponding to T2 …Tn
– Create an a-transition to a new node v′ – Create a suffix link from v to v′
Suffix Trie 27
Suffix Trie
28
Incremental Construction
◮ Repeat the previous process:
– Take the suffix link at the current node
– Make a new a-transition there
– Create the suffix link from the previous node
◮ Stop if the node already has an a-transition
– Because from this point, all nodes that are reachable via suffix links already have an a-transition
Suffix Trie
29
Construction Example
Given the suffix trie for aba We want to add a new letter c
Suffix Trie
30
Construction Example
Suffix Trie
31
Construction Example
Suffix Trie
32
Construction Example
Suffix Trie
33
Construction Example
Suffix Trie
34
Construction Example
Construction Example
◮ Construction time is linear in the tree size ◮ But the tree size can be quadratic in n
– e.g., T = aa…abb…b
Suffix Trie 35
Construction Example
◮ To find P, start at the root and keep following edges labeled with P1, P2, etc.
◮ Got stuck? Then P doesn’t exist in T
Suffix Trie 36
Outline
String Matching Problem
Hash Table
Knuth-Morris-Pratt (KMP) Algorithm Suffix Trie
Suffix Array
Suffix Array 37
Suffix Array
38
Suffix Array
Suffix Array
39
– –
If you want to see how it works, read the paper on the course website http://cs97si.stanford.edu/suffix-array.pdf
Suffix Array
◮ Memory usage is O(n)
◮ Has the same computational power as suffix trie ◮ Can be constructed in O(n) time (!)
– But it’s hard to implement
◮ ThereisanapproachableO(nlog2n)algorithm
Notes on String Problems
◮ Always be aware of the null-terminators
◮ Simple hash works so well in many problems
◮ If a problem involves rotations of some string, consider concatenating it with itself and see if it helps
◮ Stanford team notebook has implementations of suffix arrays and the KMP matcher
Suffix Array 40
Reviews
There are no reviews yet.