Algorithms
Topics:
String Searching

# String Searching

String Searching by KMP algorithm (Knuth Morris Pratt algorithm)

Motivation Problem: Given $2$ strings $P$(pattern) and $T$(text), find the number of occurrences of $P$ in $T$.

Basic / Brute Force solution:

One obvious and easy to code solution that comes to mind is this: For each index of $T$, take it as a starting point and find if $T_{i,i+1,...,i+|P|-1}$ is equal to $P$.

for i = 0 to length(T)-length(P):

Found = true

for j = i to i + length(P) - 1:

if P[j-i] not equal to T[j]:

Found = False

if Found = True



This brute force takes $O(|P| \cdot |T|)$ time in the worst case, which is obviously too slow for large strings.

Knuth Morris Pratt Algorithm:

Suppose for each index $i$ of some string $Z$, the longest suffix in $Z_{0,1,...,i}$ that is also a prefix of $Z_{0,1,...,i}$, be known. Formally, a length $F_i$ is known such that $Z_{0,1,...,F_i-1}$ = $Z_{i-F_i+1,...,i}$. Let these lengths be stored in array $F$. The suffix needs to be proper(whole string is not a proper suffix).

Then the solution to the motivation problem can be found as follows:

Define a string $V = P + '#' + T$, where $'#'$ is a delimiter that is not present in either of $P$ or $T$. Now, if the above information is known, all occurrences of $P$ in $T$ can be found as follows: If at some index $i$, $F_i = |P|$, then there is an occurrence of Pattern $P$ at position $i-|P|+1$. All such indices from $|P|+1$ [0 based indexing, the index just after '#'], need to be checked.

The main part of KMP algorithm calculates the array $F$, which is also called the prefix function. If calculation of $F$ or the prefix function can be done efficiently, then we have an efficient solution to the motivation problem. KMP algorithm finds the prefix function in $O(length of String)$ time.

To find the prefix function, best possible use of previous values of array $F$ is made, so that calculations aren't done again and again. Suppose all $F_i$ have been calculated, and now $F_{i+1}$ is to be calculated. It is to be noted that, value of $F_{i+1}$ can be at most 1 greater than $F_i$. Here is a proof by contradiction:

Suppose $F_{i+1} > F_i + 1$. Now, if the $(i+1)^{th}$ character is removed, we obtain a suffix ending at index $i$ that is of length $F_{i+1} - 1$, which is greater than $F_i$. This is a contradiction, hence proved.

Observe that if $Z_{i+1} = Z_{F_i}$, then the value of $F_{i+1} = F_i + 1$. If not, a smaller suffix ending at index $i$ is to be found, that is also a prefix of $Z_{0,1...,i}$. Let the length of such a suffix be $j$, then if $Z_{i+1} = Z_{j}$ then $F_{i+1} = j + 1$. If again the equality doesn't hold true, smaller and smaller suffixes that end at index $i$, which are also prefixes of $Z_{0,1...,i}$ need to be found.

The only thing remaining is, how to find the length of next smaller suffix ending at index $i$, that is also a prefix? This is also pretty simple. Observe that due to the property of $F$, the segment $Z_{0,1,...,F_i-1}$ is equal to the segment $Z_{i-F_i+1,...,i}$. So to find the next smaller suffix ending at index $i$, the longest suffix ending at $F_i - 1$ can be found which is $F_{F_i-1}$, and this suffix will be the next smaller suffix ending at index $i$. If this suffix also doesn't satisfy our criteria, then smaller suffixes can be found with the same process, here it will be $F_{F_{F_i-1} - 1}$. Note that, if at some point the length becomes $0$, the process is stopped.

This completes $KMP$ algorithm. Below is the code:

vector<int> prefix_function (string Z) {

int n = (int) Z.length();

vector<int> F (n);

F=0;

for (int i=1; i<n; ++i) {

int j = F[i-1];

while (j > 0 && Z[i] != Z[j])

j = F[j-1];

if (Z[i] == Z[j])  ++j;

F[i] = j;

}

return F;

}


Finding the $F$ array for "ABCABC"

Initially, $F_0 = 0$.

Index $1 \rightarrow F_{0} = 0$, $j$ does not go into while loop and $Z_j \neq Z_i$, therefore value of $F_i = 0$.

Index $2 \rightarrow F_{1} = 0$, $j$ does not go into while loop and $Z_j \neq Z_i$, therefore value of $F_i = 0$.

Index $3 \rightarrow F_{2} = 0$, $j$ does not go into while loop and $Z_j = Z_i$, therefore value of $F_i = 1$.

Index $4 \rightarrow F_{3} = 1$, $j$ satisfies while loop condition but $Z_j = Z_i$, hence does not go into while loop, therefore value of $F_i = 2$.

Index $5 \rightarrow F_{4} = 2$, $j$ satisfies while loop condition but $Z_j = Z_i$, therefore value of $F_i = 3$.

Contributed by: Rishi Vikram