3

Given a suffix array, a TopCoder task from SRM 630 asks to find the minium number of distinct characters in the string that could form a string with the given suffix array. The full problem statement can be found on the TopCoder website.

The best solution I found is right here: https://github.com/ftiasch/acm-icpc/blob/6db1ed02a727611830b974a1d4de38bab8f390f9/topcoder/single-round-match/single-round-match-630/SuffixArrayDiv1.java

Here is the algorithm written by ftiasch:

public int minimalCharacters(int[] array) {
    int n = array.length;
    int[] position = new int[n + 1];
    for (int i = 0; i < n; ++i) {
        position[array[i]] = i;
    }
    position[n] = -1;
    int[] minimum = new int[n + 1];
    for (int i = n - 1; i >= 0; --i) {
        minimum[i] = Integer.MAX_VALUE;
        for (int j = i + 1; j <= n; ++j) {
            boolean valid = true;
            for (int x = i; x < j; ++x) {
                for (int y = x + 1; y < j; ++y) {
                    valid &= position[array[x] + 1] < position[array[y] + 1];
                }
            }
            if (valid && minimum[j] < Integer.MAX_VALUE) {
                minimum[i] = Math.min(minimum[i], minimum[j] + 1);
            }
        }
    }
    return minimum[0];
}

I understand that this is a dynamic programming algorithm but how does it work? I really need a hand understanding it.

EDIT

Here is what ftiasch wrote me back:

hi Ariel,

First of all, thanks to your compliment. Frankly speaking, my solution is not the best solution to the problem. The optimal one runs in O(n) time but mine takes O(n^4). I just picked this idea during the contest because n is relatively small.

Keep in mind that same characters become continuous in the SA. Since the problem asked for the least number of characters, so I decided to use dynamic programming to partition the SA into consecutive segments so that each segments start with the same character.

Which condition is necessary for S[SA[i]] == S[SA[j]] assumed that i < j? The lexicographic comparison suggests that suffix(SA[i] + 1) should be smaller than suffix(SA[j] + 1). We can easily find that the condition is also sufficient.

Write to me if you have any other question. :)

EDIT1

We finally managed to make it work, thanks to David. Here is the linear time algorithm in java from David's Python version:

public int minimalCharacters(int[] array) {
    int n = array.length, i;
    if (n == 0)
        return 0;
    int[] array1 = new int[n + 1];
    for (i = 0; i < n; i++)
        array1[1 + i] = array[i];
    int[] position = new int[n + 1];
    for (i = 0; i < n + 1; i++)
        position[array1[i]] = i;
    int k = 1;
    for (i = n; i > 1; i--) {
        if (position[array1[i] + 1] <= position[array1[i - 1] + 1])
            k++;
    }
    return k;
}
Ariel
  • 1,222
  • 2
  • 14
  • 25
  • Needs a better question title. – ggovan Sep 04 '14 at 23:36
  • Ya like "Need someone to explain step by step and line by line how this code works". – takendarkk Sep 05 '14 at 00:10
  • 2
    @Takendarkk This seems to be more of a conceptional question: Why does this algorithm solve the problem at hand? To me it's not immediately obvious what the idea behind it is even though I understand exactly what each line does individually – Niklas B. Sep 05 '14 at 00:13
  • @NiklasB. That's exactly my thought. Even though I understand every single line of code, I do not understand how can that algorithm solve the problem. – Ariel Sep 05 '14 at 08:31

1 Answers1

4

Here’s a quadratic-time algorithm. The suffix array specifies for each pair of suffixes how they compare lexicographically (and the empty suffix always is less than all of them). Let s be the unknown string and suppose that we’re comparing suffix s[i...] with suffix s[j...]. If s[i] != s[j], then the comparison of s[i] and s[j] settles it. Otherwise, the result is the same as if we compare s[i+1...] and s[j+1...].

Suppose that we wish to ensure that s[i...] < s[j...]. Clearly we need s[i] <= s[j]. In fact, unless s[i+1...] < s[j+1...], we need the strict inequality s[i] < s[j], as otherwise the tiebreaker will go the wrong way. Otherwise, s[i] == s[j] will suffice regardless of the rest of the string. Gather up all of the inequalities as arcs in a directed graph with vertices corresponding to positions in s. This graph is necessarily acyclic by the total order on suffixes. Make each arc length 1 if the inequality is strict and length 0 otherwise. Output the length of the longest path, plus one (or zero if the graph is empty).

At least this many distinct letters are needed, by the corresponding chain of inequalities. What’s perhaps less clear is that this many distinct letters suffices, but if we determine the label of each vertex/position in s by the length of the longest path starting there, then the head and tail of each arc are labeled appropriately.

To get down to linear time, we can exploit the structure of the graph. It’s straightforward (though not trivial; the graph is metric after all) to show that the path visiting all vertices of the graph is the longest, so we merely have to compute its length.

Below are a transliterated version of the sample code (minChars1), an implementation straight from the description above (minChars2, now stripped of all comprehension usage), a brute force solution (minChars3), and the linear-time solution (minChars4). import itertools

def minChars1(array):
    n = len(array)
    position = [-1] * (n + 1)
    for i in range(n):
        position[array[i]] = i
    infinity = n + 1
    minimum = [infinity] * (n + 1)
    minimum[n] = 0
    for i in range(n - 1, -1, -1):
        for j in range(i + 1, n + 1):
            valid = True
            for x in range(i, j):
                for y in range(x + 1, j):
                    valid = valid and position[array[x] + 1] < position[array[y] + 1]
            if valid and minimum[j] < infinity:
                minimum[i] = min(minimum[i], minimum[j] + 1)
    return minimum[0]


def lengthOfLongestPath(graph, memo, i):
    if i not in memo:
        result = 0
        for w, j in graph[i]:
            result = max(result, w + lengthOfLongestPath(graph, memo, j))
        memo[i] = result
    return memo[i]


def minChars2(array):
    n = len(array)
    position = [-1] * (n + 1)
    for i in range(n):
        position[array[i]] = i
    graph = {}
    for i in range(n):
        graph[i] = []
        for j in range(n):
            if position[i] > position[j]:
                w = 0 if position[i + 1] > position[j + 1] else 1
                graph[i].append((w, j))
    memo = {None: -1}
    for i in range(n):
        lengthOfLongestPath(graph, memo, i)
    return max(memo.values()) + 1


def minChars3(array):
    n = len(array)
    position = [None] * n
    for i in range(n):
        position[array[i]] = i
    for k in range(n):
        for s in itertools.product(range(k), repeat=n):
            valid = True
            for i in range(n):
                for j in range(n):
                    valid = valid and (s[i:] < s[j:]) == (position[i] < position[j])
            if valid:
                return k
    return n


def minChars4(array):
    n = len(array)
    if n == 0:
        return 0
    array1 = [n] * (n + 1)
    for i in range(n):
        array1[1 + i] = array[i]
    position = [None] * (n + 1)
    for i in range(n + 1):
        position[array1[i]] = i
    k = 1
    for i in range(n, 1, -1):
        if position[array1[i] + 1] <= position[array1[i - 1] + 1]:
            k += 1
    return k


def test():
    for n in range(7):
        for array in itertools.permutations(range(n)):
            assert minChars1(array) == minChars2(array) == minChars3(array) == minChars4(array)


test()
David Eisenstat
  • 64,237
  • 7
  • 60
  • 120
  • I know how to calculate the path with maximum weight but the way you build the graph is not so clear. Can you please rewrite this LOC in a programming language, preferably java? `graph = {i: [(int(position[i + 1] > position[j + 1]), j) for j in range(n) if position[i] < position[j]] for i in range(n)}` – Ariel Sep 05 '14 at 09:51
  • 1
    @Ariel I replaced the comprehension with a loop and, while testing the new code, found that both previous versions were incorrect. I added a brute force version to compare against. – David Eisenstat Sep 05 '14 at 15:03
  • 1
    @Ariel I added a description and code for the linear-time solution. – David Eisenstat Sep 05 '14 at 15:48
  • Do you use one based index arrays in your algorithms? – Ariel Sep 05 '14 at 21:02
  • @Ariel Nope, it's all zero-based. I shift everything up so that rank 0 can be the empty suffix. – David Eisenstat Sep 05 '14 at 21:03
  • Try your algorithm with `int[] SA = { 0, 1, 2, 3 }` and `int expected = 2`. The result is `1` and should be `2`. – Ariel Sep 05 '14 at 21:47
  • @Ariel The Python versions all return 2 for me. – David Eisenstat Sep 05 '14 at 21:49
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/60730/discussion-between-ariel-and-david-eisenstat). – Ariel Sep 05 '14 at 22:05
  • 1
    @Ariel It turns out that I was interpreting the input as the inverse permutation. Now everything is fixed, and all four versions agree on all small inputs. – David Eisenstat Sep 05 '14 at 23:30