Looking for ideas: lexicographically sorted suffix array of many different strings compute efficiently an LCP array

Question

I don't want a direct solution to the problem that's the source of this question but it's this one link:

So I take in the strings and add them to a suffix array which is implemented as a sorted set internally, what I obtain then is a lexicographically sorted list of the two given strings.

S1 = "banana"
S2 = "panama"

SuffixArray.add S1, S2

To make searching for the k-th smallest substring efficient I preprocess this sorted set to add in information about the longest common prefix between a suffix and it's predecessor as well as keeping tabs on a cumulative substrings count. So I know that for a given k greater than the cumulative substrings count of the last item, it's an invalid query.

This works really well for small inputs as well as random large inputs of the constraints given in the problem definition, which is at most 50 strings of length 2000. I am able to pass the 4 out of 7 cases and was pretty surprised I didn't get them all.

So I went searching for the bottleneck and it hit me. Given large number of inputs like these

anananananananana.....ananana
bkbkbkbkbkbkbkbkb.....bkbkbkb

The queries for k-th smallest substrings are still fast as expected but not the way I preprocess the sorted set... The way I calculate the longest common prefix between the elements of the set is not efficient and linear O(m), like this, I did the most naïve thing expecting it to be good enough:

m = anananan
n = anananana

Start at 0 and find the point where `m[i] != n[i]`

It is like this because a suffix and his predecessor might no be related (i.e. coming from different input strings) and so I thought I couldn't help but using brute force.

Here is the question then and where I ended up reducing the problem as. Given a list of lexicographically sorted suffix like in the manner I described above (made up of multiple strings):

What is an efficient way of computing the longest common prefix array?.

The subquestion would then be, am I completely off the mark in my approach? Please propose further avenues of investigation if that's the case.

Foot note, I do not want to be shown implemented algorithm and I don't mind to be told to go read so and so book or resource on the subject as that is what I do anyway while attempting these challenges.

Accepted answer will be something that guides me on the right path or in the case that that fails; something that teaches me how to solve these types of problem in a broader sense, a book or something

score 2 · Accepted Answer · answered Jan 11 '13 at 20:02

2

READING

I would recommend this tutorial pdf from Stanford.

This tutorial explains a simple O(nlog^2n) algorithm with O(nlogn) space to compute suffix array and a matrix of intermediate results. The matrix of intermediate results can be used to compute the longest common prefix between two suffixes in O(logn).

HINTS

If you wish to try to develop the algorithm yourself, the key is to sort the strings based on their 2^k long prefixes.

From the tutorial:

Let's denote by A(i,k) be the subsequence of A of length 2^k starting at position i. The position of A(i,k) in the sorted array of A(j,k) subsequences (j=1,n) is kept in P(k,i).

and

Using matrix P, one can iterate descending from the biggest k down to 0 and check whether A(i,k) = A(j,k). If the two prefixes are equal, a common prefix of length 2^k had been found. We only have left to update i and j, increasing them both by 2^k and check again if there are any more common prefixes.

answered Jan 11 '13 at 20:02

Peter de Rivaz

33,126
4
46
75

This pdf is really great, the link was visited :D. Can you comment on the requirement of the suffixes coming from different source strings and if it does indeed cover that case instead of (quoted from the pdf) "Given two suffixes of a string A...". Can't wrap my head around that for some reason. – Hugo Jan 11 '13 at 20:29
1

A simple way of handling different source strings is explained in the description of problem 6. Basically add a unique end character to each string, concatenate them into a single long string, and use the standard suffix array algorithm. I think this should work in your case, although it may well be tricky to get the details correct! – Peter de Rivaz Jan 11 '13 at 20:45
This is an amazing prospect and one that I pondered today while thinking about it, but I had convinced myself that it would not work. I'll do a pen and paper observation to see what It would look like. – Hugo Jan 11 '13 at 20:53
Can't express how grateful I am to report that I got 7/7 test cases passing. You were right concatenating the strings with a unique character (I took $) gave me a way to generate the longest common prefix array efficiently. It was tricky after that getting the details correct but possible and in constant time. – Hugo Jan 12 '13 at 03:42
Actually you have invented an even better way than I was suggesting. I had meant you to use a different unique character for each string (i.e. #,$,@) - well done for getting it to work with a single unique character! – Peter de Rivaz Jan 12 '13 at 09:15
Well it couldn't be helped since I would received 1 < n < 50 strings as input and would have needed to invent a new one for each. I won't disclose how I have solved it here but if you want to see the reasoning I can shoot you an email. Thanks again Peter! – Hugo Jan 12 '13 at 17:08

Looking for ideas: lexicographically sorted suffix array of many different strings compute efficiently an LCP array

1 Answers1

READING

HINTS