What's the significance of suffixes being sorted in suffix array?

Question

I know that the definition of suffix array itself is that its a sorted array of all the suffixes of a string. But I am trying to understand whats the significance of this sorting operation here? Suppose we create an array of all the suffixes of the string and choose not to sort it and go ahead with the construction of LCP array, what do we loose in this situation when we try to solve those common problems such as Longest Palindromic sub string, Longest Repeated sub string?

I am still in the preliminary phases of understanding this data structure and if the question looks like a result of lack of my basic understanding, I offer my apologies. — discoverAnkit, Jun 14 '14 at 11:34
If you don't sort it, you can't implement any of the algorithms — Niklas B., Jun 14 '14 at 11:37
@Niklas B. Most humbly, LCP array can still be constructed, right? The basic function of LCP array is to store the lengths of the longest common prefixes between pairs of consecutive suffixes(may or may not be sorted). My question is not specific to any construction algorithm for suffix array. — discoverAnkit, Jun 14 '14 at 11:47
If you have all the suffixes in no particular order, that's the same as just having the original string (which implicitly contains all its suffixes). The whole point is to have the order. — harold, Jun 14 '14 at 11:49
@Niklas B. So what shall I deduce? The suffixes in the suffix array should be sorted so that LCP construction can be done efficiently? — discoverAnkit, Jun 14 '14 at 11:56
No, but even if you had any algorithm in mind that does not need the sorted order and would work with just the LCP between adjacent neighbors, it would not be efficient. "Suppose we create an array of all the suffixes of the string and choose not to sort it and go ahead with the construction of LCP array" is not a valid course of action because it is slow. — Niklas B., Jun 14 '14 at 12:01
@NiklasB.: In fact the LCP on the unsorted suffixes could be built in O(n) overall. If s[i] != s[i-1] then LCP[i] = 0, otherwise it will be the number of repetitions k (k >= 1) of the character at s[i], and all such entries LCP[i+j] for 0 <= j < k can be "backfilled" in linear time when the the first different character is seen at s[i+k]. This doesn't change the fact that such an LCP table would be completely unhelpful, AFAICT ;) — j_random_hacker, Jun 14 '14 at 14:32
@j_random_hacker Yeah, that sounds a bit like the Z algorithm. It's a lot more general though, so I wonder whether it's really that easy? Anyway, the LCP array is not the thing to worry about here — Niklas B., Jun 14 '14 at 15:05

score 8 · Accepted Answer · answered Jun 14 '14 at 17:22

There are two main reasons why you would want to have all the suffixes sorted inside of a suffix array.

First, if S and T are strings, we know the following:

T is a substring of S if and only if it is a prefix of a suffix of S.

For example, if S is "avoidance" and T is "ida," then T is a substring of S because it's a prefix of the suffix "idance." Therefore, applications that require quick queries about substrings of S can be rephrased in terms of searching for prefixes of suffixes of S.

Given this, if you're interested in searching for prefixes of suffixes of S, it makes sense to store those suffixes in a data structure that allows for quick searching. If we put the suffixes in an array, keeping them sorted then allows you to look up where various prefixes must be efficiently. Therefore, having a suffix array be an array of all the suffixes of S stored in sorted order enables quick searches for prefixes of suffixes and therefore for substrings of S.

As to your second question about LCP arrays - could you compute them if the suffixes weren't sorted and what would you lose if you did? - you absolutely can compute them for any array, even a non-sorted array of suffixes, so there's no fundamental reason why you couldn't do this. However, the LCP array of the sorted suffix array has a bunch of nice properties that an LCP array of an unsorted suffix array doesn't have. For example, the LCP array in a suffix array can be used to determine the depths of internal nodes in the corresponding suffix tree, or to compute longest common extensions, etc.

One hugely important property of sorted suffix arrays and LCP is that if you compute the pairwise LCP information for all the strings, you can compute LCP over arbitrary pairs of strings by performing a range minimum query over the LCP array. The reason this works is that if the suffixes are sorted, the maximum amount of overlap between adjacent strings is preserved. This doesn't work in the case where the array is unsorted (I'll mention this at the very end again.)

To see specifically where things break down, let's take the longest repeated substring problem. The normal linear-time algorithm for this using suffix arrays is the following:

Construct a suffix array for the string T.
Construct the LCP array for the generalized suffix array.
Iterate across the suffix array and find the string whose LCP value is maximum.

It's important to think about why this last step works. Consider any substring that's repeated twice, call it S. Because any substring is a prefix of a suffix, this means that the strings Sα and Sβ must be suffixes of the string T. If you store the suffix array in sorted order, then all strings beginning with the prefix S will appear consecutively in the suffix array (do you see why?). Therefore, if S is the longest repeated substring, then the first suffix starting with S with have an LCP with the next string of length |S|.

Now, consider what happens if you do this without sorting the array. In that case, if S is the longest repeated substring, the strings Sα and Sβ will still be suffixes of string T. However, they won't necessarily be consecutive in the suffix array, and so there won't necessarily be a linear-time algorithm for finding them. For example, consider the string

abracadabra

The unsorted suffix array is

abracadabra$
bracadabra$
racadabra$
acadabra$
cadabra$
adabra$
dabra$
abra$
bra$
ra$
a$
$

After annotating with LCP information, we get

0 abracadabra$
0 bracadabra$
0 racadabra$
0 acadabra$
0 cadabra$
0 adabra$
0 dabra$
0 abra$
0 bra$
0 ra$
0 a$
  $

So you can see that this algorithm won't find "abra" because they aren't consecutive. You could still conceivably figure out that it was "abra" by trying all pairs, but that's not efficient for large strings.

I mentioned earlier that LCP information about adjacent pairs of strings in sorted suffix arrays can be used to compute LCP information about arbitrary pairs of strings in sorted suffix arrays. This isn't true if the strings are unsorted; above, you can see that the strings all have adjacent pairwise LCP of 0 even though some of the strings certainly do have nonzero common prefix.

Hope this helps!

Thank you so much for the reply. I am new to this data structure so can you please tell me which algorithm should I start with for construction of suffix array and lcp array? An algo with time complexity O(n log n) will do for me cause right now I am not looking for a very complex algorithm. Thanks :) — discoverAnkit, Jun 15 '14 at 07:15
@ankitG I described one [as an answer to this question](http://stackoverflow.com/questions/21220150/rank-the-suffix-of-a-list) — Niklas B., Jun 15 '14 at 09:42

What's the significance of suffixes being sorted in suffix array?

1 Answers1