1

Knowing the internal nodes is helpful in a suffix tree, since they can help you solve problems like finding the longest repeating substring.

These are hard to construct on the spot (think a whiteboard interview). So people have told me to look into suffix arrays.

I have a two part question:

1. Can you create a suffix array without building a suffix tree first? From what I have seen, most implementations build the trie and then traverse it to create a suffix array.

2. Given a suffix array, how can you identify the internal nodes?

ApathyBear
  • 9,057
  • 14
  • 56
  • 90

1 Answers1

1

(In my opinion this would be an exceptionally hard question for a whiteboard interview...)

To answer part 1, yes it is possible (and usual) to construct the suffix array directly.

This link to stanford.edu gives a short O(nlog^2n) algorithm that is simple to implement:

#include <cstdio>
#include <cstring>
#include <algorithm> using namespace std;
#define MAXN 65536
#define MAXLG 17
char A[MAXN];
struct entry { int nr[2], p;
} L[MAXN];
int P[MAXLG][MAXN], N, i, stp, cnt;
int cmp(struct entry a, struct entry b)
{
  return a.nr[0] == b.nr[0] ? (a.nr[1] < b.nr[1] ? 1 : 0) : (a.nr[0] < b.nr[0] ? 1 : 0);
}
int main(void)
{
  gets(A); for (N = strlen(A), i = 0; i < N; i ++)
  P[0][i] = A[i] - 'a';
  for (stp = 1, cnt = 1; cnt >> 1 < N; stp ++, cnt <<= 1) {
    for (i = 0; i < N; i ++)
      { L[i].nr[0] = P[stp - 1][i];
        L[i].nr[1] = i + cnt < N ? P[stp - 1][i + cnt] : -1;
        L[i].p = i; }
    sort(L, L + N, cmp);
    for (i = 0; i < N; i ++) P[stp][L[i].p] = i > 0 && L[i].nr[0] == L[i - 1].nr[0] && L[i].nr[1] == L[i - 1].nr[1] ?
    P[stp][L[i - 1].p] : i;
  } return 0;
} 

This PDF also discusses how to use suffix arrays in practical examples.

Alternatively, this 2005 paper "Linear Work Suffix Array Construction" gives a O(n) approach for constructing suffix arrays with 50 lines of code.

In my experiments on a string of length 100k, I found a suffix tree (using Ukkonen's O(n) algorithm) to take 16 seconds, the O(nlog^2n) suffix array to take 2.4 seconds, and the O(n) suffix array to take 0.5 seconds.

Peter de Rivaz
  • 33,126
  • 4
  • 46
  • 75
  • Great, great links. I am going to enjoy reading this (it was just what I was looking for). I agree, this would be a bit ridiculous to expect on a white board interview. How did you find these? Have you attended Stanford? – ApathyBear Nov 30 '15 at 22:54
  • No, I've never been to Stanford. I just had a problem that I needed to use suffix trees to solve and used Google as far as I remember. (I'm afraid I kept notes on the algorithms I used and their speeds, but not on how I found them in the first place.) – Peter de Rivaz Dec 01 '15 at 09:20