Suffix Arrays

Problems
Tutorial

Motivation Problem: Given a string $$S$$, find the longest sub string that occurs at least $$M$$ times.

Brute Force method: For every sub string $$X$$ of $$S$$, one can find all the occurrences of $$X$$ in $$S$$ by $$KMP$$. $$KMP$$ takes $$O(N)$$ time, so the total time for this brute force method will be $$O(N^3)$$.

A faster solution using hashing: We can binary search the length of the sub string. For a current length $$X$$ in the binary search, hash of every sub string of length $$X$$ can be found in $$O(N)$$ time. While doing this, the hashes can be stored in a dictionary, and when all sub strings of length $$X$$ are processed, the hash with maximum frequency is to be checked if it has frequency greater than equal to $$M$$. This takes $$O(N (log(N))^2)$$ time, where a log term comes due to maintaining the dictionary(map in C++).

A solution using Suffix Array:

A Suffix Array is a sorted array of suffixes of a string. Only the indices of suffixes are stored in the string instead of whole strings. For example: Suffix Array of "banana" would look like this:

$$5 \rightarrow $$ $$a$$

$$3 \rightarrow $$ $$ana$$

$$1 \rightarrow $$ $$anana$$

$$0 \rightarrow $$ $$banana$$

$$4 \rightarrow $$ $$na$$

$$2 \rightarrow $$ $$nana$$

One naive way to make the suffix array would be to store all suffixes in an array and sort them. If we use an $$O(N log(N))$$ comparison based sorting algorithm, then the total time to make the suffix array would be $$O(N^2 log N)$$, because string comparison takes $$O(N)$$ time. This is too slow for large strings.

Below is shown an $$O(N (log N)^2)$$ algorithm that constructs the suffix array. There is an $$O(N log N)$$ algorithm and even an $$O(N)$$ algorithm to construct suffix array, but in a programming contest environment, it is much easier to implement an $$O(N (log N)^2)$$ algorithm. Also the difference between an $$O(N (log N)^2)$$ and $$O(N log N)$$ algorithm is scarcely noticeable for strings up to length $$10^5$$.

The algorithm is based on keeping the ranks of suffixes when the suffixes are sorted by their first $$2^k$$ characters in the $$k^{th}$$ step. Therefore we will execute $$O(log N)$$ steps to completely build the suffix array.

It can be easily seen that, comparison of $$2$$ strings should be optimised, and should be done in better than $$O(N)$$. It can actually be done and the string comparison of $$2$$ suffixes can be done in $$O(1)$$ time. To do this, the fact that $$2$$ suffixes of the same string are being sorted should be used.

Now suppose that an order relation between the suffixes has been obtained when they are sorted by their first $$2^k$$ characters. That is, $$k$$ steps of the algorithm have been done. Now to obtain the order relation in $$(k+1)^{th}$$ step, best possible use of order relations in previous steps must be done. Now in the $$(k+1)^{th}$$ step, suppose comparison of $$2$$ suffixes at indices $$i$$ and $$j$$ needs to be done. Let us denote the rank of $$y^{th}$$ suffix after $$x$$ steps by $$P_{xy}$$.

Observation: A string of length $$2^{k+1}$$ can be broken down into $$2$$ strings of length $$2^k$$. If $$P_{ki} < P_{kj}$$, then $$P_{(k+1)i} < P_{(k+1)j}$$ and we know the relation. Else if $$P_{ki} > P_{kj}$$, then again we know the relation between them. If $$P_{ki} = P_{kj}$$, then we can obtain the relation between $$P_{(k+1)i}$$ and $$P_{(k+1)j}$$ by comparing $$P_{k(i+2^k)}$$ and $$P_{k(j+2^k)}$$, because the first $$2^k$$ characters of the suffixes starting at indices $$i$$ and $$j$$ are same as $$P_{ki}$$ = $$P_{kj}$$. If $$P_{k(i+2^k)}$$ and $$P_{k(j+2^k)}$$ are also same, then we assign the same rank to both the suffixes.

Therefore at step $$(k+1)$$, to compare $$2$$ suffixes in $$O(1)$$ time, a tuple of $$2$$ integers can be stored for each suffix. Let us name the suffix $$suf$$ and its index be $$i$$. First integer of tuple that will be stored for $$suf$$ would be $$P_{ki}$$, that is the rank of $$suf$$ when it was sorted by first $$2^k$$ characters. Second integer of tuple that will be stored would be $$P_{k(i+2^k)}$$, that is the rank of suffix starting at index $$(i+2^k)$$, when it was sorted by the first $$2^k$$ characters. This tuple is enough to compare $$2$$ suffixes in $$O(1)$$ time as shown above.

It might be possible that $$(i+2^k)$$ exceeds the string length. In that case some negative number can be assigned to the second integer of tuple of $$suf$$, so that lexicographic order can be maintained. The importance of assigning a negative number to the second integer of tuple can be understood as follows: Let there be $$2$$ suffixes that are ranked same according to their first $$2^k$$ characters and let length of first suffix be greater or equal to $$2^{k+1}$$ and let length of second suffix be less than $$2^{k+1}$$. As the rank of these suffixes is same according to their first $$2^k$$ characters, second suffix should surely come before the first suffix in lexicographical ordering because it is of lesser length. Therefore assigning a negative number to the second integer of tuple can help here.

Here is some pseudo code to construct suffix array.

SA = [] // Suffix Array

P = [][] // P[i][j] denotes rank of suffix at position 'j' when all suffixes are sorted by their first '2^i' characters

str = [] // initial string, 1 based indexing

POWER = [] //array of powers of 2, POWER[i] denotes 2^i

tuple {
    first, second, index;
}

L = [] // Array of Tuples

N = length of str

for i = 1 to N:
    P[0][i] = str[i] - 'a' // Give initial rank when suffixes are sorted by their first 2^0 = 1 character.

step = 1

for i = 1; POWER[i-1]<N; i++, step++:
    for j = 1 to N:
        L[j].index = j
        L[j].first = P[i-1][j]
        L[j].second = (j+POWER[i-1]<=n ? P[i-1][j+POWER[i-1]] : -1)

    sort(L)

    for j = 1 to N:
        P[i][L[j].index] = ((j>1 and L[j].first==L[j-1].first and L[j].second==L[j-1].second) ? P[i][L[j-1].index] : j) 
        /*Assign same rank to suffixes which have same number in the first and second fields of their respective tuples.*/

step = step - 1

Now at the $$step^{th}$$ row of matrix $$P$$, we have the ranks of all suffixes. Now we can get the suffix array very easily in $$O(N)$$.

for i = 1 to N:
    SA[P[step][i]] = i

Note: Care must be taken when string length is $$1$$, in that case if the string is "c", then it will get a rank of ('c'-'a') that is $$2$$ because we will not enter the for loop. In this case you can manually put the rank as $$1$$, that is P[0][1]=1 instead of P[0][1]=str[1]-'a'.

Often it is required to find the Longest Common Prefix (LCP) of $$2$$ suffixes. This can be done easily in $$O(log N)$$ time by using the array $$P$$. The following fact is used to find the $$LCP$$ of $$2$$ suffixes starting at indices $$i$$ and $$j$$: If P[x][i]==P[x][j], then first $$2^x$$ characters starting at indices $$i$$ and $$j$$ are same. Below is the pseudo code:

LCP(i,j): //returns the length of LCP of suffixes starting at indices i and j

    if i==j:

        return N-i+1

    return_value=0

    for x = step to 0:

        if P[x][i]==P[x][j]:

            return_value = return_value + POWER[x]

            i = i + POWER[x]

            j = j+ POWER[x]

    return return_value

Now coming to the original problem, to find the longest sub string that occurs at least $$M$$ times.

First build the suffix array of string $$S$$. If in the sorted array of suffixes, the $$LCP$$ of $$2$$ suffixes is $$K$$, then the prefix of length $$K$$ of all suffixes between these $$2$$ suffixes is same. Let index of these $$2$$ suffixes be $$i$$ and $$j$$ $$(i<j)$$, then a sub string of length $$K$$ repeats $$(j-i+1)$$ times.

To find the solution to motivation problem, one can iterate through all the suffixes in sorted order from $$0$$ to $$(N-M+1)$$, and find the $$LCP$$ of current suffix and suffix at index $$(M-1)$$ greater than it. This $$LCP$$ will repeat at least $$M$$ times, and the maximum of all these $$LCPs$$ can be taken. Time complexity: $$O(N (log N)^2)$$.

Pseudo code:

build Suffix Array

for i = 1 to (N-M+1):
    ans=max(ans, LCP(SA[i],SA[i+M-1]))

Contributed by: Rishi Vikram

View all comments