String Algorithms
Jaehyun Park CS 97SI
Stanford University June 30, 2015 
String Matching Problem
Hash Table
Knuth-Morris-Pratt (KMP) Algorithm Suffix Trie
Suffix Array
Outline
String Matching Problem 2 
String Matching Problem
 Given a text T and a pattern P, find all occurrences of P within T
 Notations:
 nandm: lengthsofP andT
 : set of alphabets (of constant size)  Pi: ith letter of P (1-indexed)
 a, b, c: single letters in 
 x, y, z: strings
String Matching Problem 3 
Example
 T = AGCATGCTGCAGTCATGCTTAGGCTA  P = GCT
 P appears three times in T
 A naive method takes O(mn) time
 Initiate string comparison at every starting point  Each comparison takes O(m) time
 We can do much better!
String Matching Problem 4 
String Matching Problem
Hash Table
Knuth-Morris-Pratt (KMP) Algorithm Suffix Trie
Suffix Array
Outline
Hash Table 5 
Hash Function
 A function that takes a string and outputs a number  A good hash function has few collisions
 i.e., If x = y, H(x) = H(y) with high probability
 An easy and powerful hash function is a polynomial mod some prime p
 Consider each letter as a number (ASCII value is fine)
 H(x1 xk) = x1ak1 +x2ak2 ++xk1a+xk (mod p)  How do we find H(x2 xk+1) from H(x1 xk)?
Hash Table 6 
Hash Table
 Main idea: preprocess T to speedup queries  Hash every substring of length k
 k is a small constant
 For each query P , hash the first k letters of P to retrieve all the occurrences of it within T
 Dont forget to check collisions!
Hash Table 7 
 Pros:
 Easy to implement
Hash Table
 Significant speedup in practice
 Cons:
 Doesnt help the asymptotic efficiency
 Can still take (nm) time if hashing is terrible or data is difficult
 A lot of memory consumption
Hash Table 8 
Outline
String Matching Problem
Hash Table
Knuth-Morris-Pratt (KMP) Algorithm Suffix Trie
Suffix Array
Knuth-Morris-Pratt (KMP) Algorithm 9 
Knuth-Morris-Pratt (KMP) Matcher
 A linear time (!) algorithm that solves the string matching problem by preprocessing P in (m) time
 Main idea is to skip some comparisons by using the previous comparison result
 Uses an auxiliary array  that is defined as the following:
 [i] is the largest integer smaller than i such that P1 . . . P[i] is
a suffix of P1 Pi
  Its better to see an example than the definition
Knuth-Morris-Pratt (KMP) Algorithm 10 
 Table Example (from CLRS)
 [i]isthelargestintegersmallerthanisuchthatP1P[i] is a suffix of P1 Pi
 e.g., [6] = 4 since abab is a suffix of ababab
 e.g., [9] = 0 since no prefix of length  8 ends with c
 Lets see why this is useful
Knuth-Morris-Pratt (KMP) Algorithm 11 
Using the  Table
 T = ABC ABCDAB ABCDABCDABDE
 P = ABCDABD
  = (0,0,0,0,1,2,0)
 Start matching at the first position of T:
 Mismatchatthe4thletterofP!
Knuth-Morris-Pratt (KMP) Algorithm 12 
Using the  Table
 Wematchedk=3letterssofar,and[k]=0
 Thus, there is no point in starting the comparison at T2, T3
(crucial observation)
 ShiftP byk[k]=3letters
 Mismatch at T4 again!
Knuth-Morris-Pratt (KMP) Algorithm 13 
 Mismatch at T11!
Using the  Table
 We matched k = 0 letters so far
 ShiftP byk[k]=1letter(wedefine[0]=1)
Knuth-Morris-Pratt (KMP) Algorithm 14 
Using the  Table
 [6]=2meansP1P2 isasuffixofP1P6
 ShiftP by6[6]=4letters
 Again, no point in shifting P by 1, 2, or 3 letters Knuth-Morris-Pratt (KMP) Algorithm 15 
 Mismatch at T11 again!
 Currently 2 letters are matched  ShiftP by2[2]=2letters
Using the  Table
Knuth-Morris-Pratt (KMP) Algorithm 16 
 Mismatch at T11 yet again!
 Currently no letters are matched  ShiftP by0[0]=1letter
Using the  Table
Knuth-Morris-Pratt (KMP) Algorithm 17 
 Mismatch at T18
 Currently 6 letters are matched  ShiftP by6[6]=4letters
Using the  Table
Knuth-Morris-Pratt (KMP) Algorithm 18 
Using the  Table  Finally, there it is!
 Currently all 7 letters are matched
 After recording this match (at T16 . . . T22, we shift P again in order to find other matches
 Shiftby7[7]=7letters
Knuth-Morris-Pratt (KMP) Algorithm 19 
Computing 
 Observation 1: if P1 P[i] is a suffix of P1 Pi, then P1 P[i]1 is a suffix of P1 Pi1
 Well, obviously
 Observation 2: all the prefixes of P that are a suffix of P1 . . . Pi can be obtained by recursively applying  to i
 e.g., P1 P[i], P1 ,P[[i]], P1 ,P[[[i]]] are all suffixes of P1 Pi
Knuth-Morris-Pratt (KMP) Algorithm 20 
Computing 
 A non-obvious conclusion:
 First, lets write (k)[i] as [] applied k times to i
 e.g., (2)[i] = [[i]]
 [i] is equal to (k)[i  1] + 1, where k is the smallest integer
that satisfies P(k)[i1]+1 = Pi
 Ifthereisnosuchk,[i]=0
 Intuition: we look at all the prefixes of P that are suffixes of P1 . . . Pi1, and find the longest one whose next letter matches Pi
Knuth-Morris-Pratt (KMP) Algorithm 21 
Implementation
pi[0] = -1;
int k = -1;
for(int i = 1; i <= m; i++) {while(k >= 0 && P[k+1] != P[i])
k = pi[k];
pi[i] = ++k; }
Knuth-Morris-Pratt (KMP) Algorithm 22 
Pattern Matching Implementation
int k = 0;
for(int i = 1; i <= n; i++) {while(k >= 0 && P[k+1] != T[i])
k = pi[k];
k++;
if(k == m) {
// P matches T[i-m+1..i]
k = pi[k]; }
}
Knuth-Morris-Pratt (KMP) Algorithm 23 
Outline
String Matching Problem
Hash Table
Knuth-Morris-Pratt (KMP) Algorithm Suffix Trie
Suffix Array
Suffix Trie 24 
Suffix Trie
 Suffix trie of a string T is a rooted tree that stores all the suffixes (thus all the substrings)
 Each node corresponds to some substring of T
 Each edge is associated with an alphabet
 For each node that corresponds to ax, there is a special pointer called suffix link that leads to the node corresponding to x
 Surprisingly easy to implement!
Suffix Trie 25 
Suffix Trie
26
Example
(Figure modified from Ukkonens original paper) 
Incremental Construction
 Given the suffix tree for T1 Tn
 Then we append Tn+1 = a to T , creating necessary nodes
 StartatnodeucorrespondingtoT1Tn  Create an a-transition to a new node v
 Take the suffix link at u to go to u, corresponding to T2 Tn
 Create an a-transition to a new node v  Create a suffix link from v to v
Suffix Trie 27 
Suffix Trie
28
Incremental Construction
 Repeat the previous process:
 Take the suffix link at the current node
 Make a new a-transition there
 Create the suffix link from the previous node
 Stop if the node already has an a-transition
 Because from this point, all nodes that are reachable via suffix links already have an a-transition 
Suffix Trie
29
Construction Example
Given the suffix trie for aba We want to add a new letter c 
Suffix Trie
30
Construction Example 
Suffix Trie
31
Construction Example 
Suffix Trie
32
Construction Example 
Suffix Trie
33
Construction Example 
Suffix Trie
34
Construction Example 
Construction Example
 Construction time is linear in the tree size  But the tree size can be quadratic in n
 e.g., T = aaabbb
Suffix Trie 35 
Construction Example
 To find P, start at the root and keep following edges labeled with P1, P2, etc.
 Got stuck? Then P doesnt exist in T
Suffix Trie 36 
Outline
String Matching Problem
Hash Table
Knuth-Morris-Pratt (KMP) Algorithm Suffix Trie
Suffix Array
Suffix Array 37 
Suffix Array
38
Suffix Array 
Suffix Array
39
If you want to see how it works, read the paper on the course website http://cs97si.stanford.edu/suffix-array.pdf
Suffix Array
 Memory usage is O(n)
 Has the same computational power as suffix trie  Can be constructed in O(n) time (!)
 But its hard to implement
 ThereisanapproachableO(nlog2n)algorithm 
Notes on String Problems
 Always be aware of the null-terminators
 Simple hash works so well in many problems
 If a problem involves rotations of some string, consider concatenating it with itself and see if it helps
 Stanford team notebook has implementations of suffix arrays and the KMP matcher
Suffix Array 40 

![[SOLVED]  algorithm String Algorithms](https://assignmentchef.com/wp-content/uploads/2022/08/downloadzip.jpg)

![[Solved] Modularized Body Mass Index (BMI) Program in Python](https://assignmentchef.com/wp-content/uploads/2022/08/downloadzip-1200x1200.jpg)
 
 
 
Reviews
There are no reviews yet.