Questions tagged [suffix-array]

A suffix array is a data structure that represents the lexicographically sorted list of all suffixes of a string (in the computer-science, not the linguistics, sense of the word suffix). It is the basis for many high-performance algorithms performed on very large strings, for example full-text search or compression.

A suffix array is a data structure that represents the lexicographically sorted list of all suffixes of a string. It is the basis for many high-performance algorithms performed on very large strings, for example full-text search or compression.

Formal definitions

String: A string is an ordered sequence of symbols, each taken from a pre-defined, finite set. That set is called alphabet, or character set. The symbols are often referred to as characters.

Suffix: Given a string T of length n, a suffix of T is defined as a substring that starts at any position of T and ends at position n (the end of T).

Example: Let T:=abc, then abc,bc and c are suffixes of T, but a and ab are not.

Remark: Any string T of length n has exactly n distinct suffixes (as many as there are characters in it), because any character is the beginning of exactly one suffix.

Suffix array: Given a string T of length n, and a linear ordering on the alphabet, the suffix array of T is the lexicographically sorted list of all suffixes of T.

Example: Let T:=abcabx and assume the 'natural' alphabetic ordering, i.e. a < b < c < d... < x < y < z. Then the suffix array of T is as follows.

abcabx
abx
bcabx
bx
cabx
x

Implementation

The suffix array is usually not explicitly stored in memory. Instead it is represented as a list of integers, each representing the starting position of a suffix.

abcabx 012345

Example: Given T as defined above, and assume a numbering of its positions from 0 to 5, the suffix array is represented as the list [0,3,1,4,2,5].

The suffix-array tag

Many of the questions tagged suffix-array are related to one of the topics below.

  • How to construct suffix arrays efficiently
  • How to store, and possibly compress, them efficiently
  • How to make use of them for various purposes, such as full-text search, detection of regularities in strings and text-compression
  • How they are used in various fields of application, in particular bioinformatics, genetics and natural language processing
  • What existing and/or ready-to-use implementations of any of the above are known
  • Worst-case, average-case and empirical comparisons of time and space requirements of existing algorithms and implementation
154 questions
7
votes
1 answer

Using suffix array algorithm for Burrows Wheeler transform

I've sucessfully implemented a BWT stage (using regular string sorting) for a compression testbed I'm writing. I can apply the BWT and then inverse BWT transform and the output matches the input. Now I wanted to speed up creation of the BW index…
Bim
  • 1,008
  • 1
  • 10
  • 29
7
votes
3 answers

How to Modify a Suffix Array to search multiple strings?

I've recently been updating my knowledge of algorithms and have been reading up on suffix arrays. Every text I've read has defined them as an array of suffixes over a single search string, but some articles have mentioned its 'trivial' to generalize…
swestrup
  • 4,079
  • 3
  • 22
  • 33
6
votes
1 answer

Find longest common substring of multiple strings using factor oracle enhanced with LRS array

Can we use a factor-oracle with suffix link (paper here) to compute the longest common substring of multiple strings? Here, substring means any part of the original string. For example "abc" is the substring of "ffabcgg", while "abg" is not. I've…
Ray
  • 1,647
  • 13
  • 16
5
votes
4 answers

Suffix array beginning using scala

Today I am trying to create suffix arrays using scala. I was able to do it with massive lines of code but then I heard that it can be created by using only few lines by using zipping and sorting. The problem I have at the moment is with the…
Duzzz
  • 191
  • 3
  • 14
5
votes
2 answers

Efficient suffix array algorithm in c#

Does anyone have any suggestions about where I can find a C# implementation for suffix arrays? I'd prefer not to reinvent the wheel...
code4life
  • 15,655
  • 7
  • 50
  • 82
4
votes
2 answers

Good pedagogical ressources on suffix arrays

I simply cannot find any good pedagogical ressource explaining suffix arrays. Even the "bible" doesn't cover it. Where can I find a clear and thorough explanation of suffix arrays and their uses? (A video course would be ideal, because I'm lazy.)
Randomblue
  • 112,777
  • 145
  • 353
  • 547
4
votes
1 answer

How to find some string matching a given suffix array?

I have a suffix array. How to get a string, which suffix array will be equal to the given array? For example. Let I have this array: [7, 6, 4, 2, 1, 5, 3]. Then the string banana$ is good for me, since get_suffix_array(banana$) == [7, 6, 4, 2, 1, 5,…
David
  • 674
  • 6
  • 19
4
votes
1 answer

How does this code for obtaining LCP from a Suffix Array work?

Can someone explain how this code for constructing the LCP from a suffix array works? suffixArr[] is an array such that suffixArr[i] holds the value of the index in the string for the suffix with rank i. void LCPconstruct() { int…
Dhruv Mullick
  • 551
  • 9
  • 25
4
votes
2 answers

Suffix Array Implementation Bugs

I've coded a Suffix Array implementation and discovered an issue in my implementation. Concretely I've outputted the first few suffix array ranks RA[0..7] of this string(length = 10^5) and had the following…
Gary Ye
  • 847
  • 1
  • 7
  • 9
4
votes
2 answers

Suffix array DC3 algorithm

I am going over the DC3 algorithm, the linear time algorithm for construction of suffix arrays. I am unable to understand a technique in the paper which can be found here. I am unable to understand how the renaming, mentioned on page 6 of the paper,…
Fluvid
  • 268
  • 3
  • 13
4
votes
1 answer

Finding a set of repeated, non-overlapping substrings of two input strings using suffix arrays

Input: two strings A and B. Output: a set of repeated, non overlapping substrings I have to find all the repeated strings, each of which has to occur in both(!) strings at least once. So for instance, let A = "xyabcxeeeyabczeee" and B =…
kafka
  • 121
  • 6
4
votes
4 answers

Minimum Lexicographic Rotation Using Suffix Array

Consider a string of length n (1 <= n <= 100000). Determine its minimum lexicographic rotation. For example, the rotations of the string “alabala” are: alabala labalaa abalaal balaala alaalab laalaba …
Ritesh Kumar Gupta
  • 5,055
  • 7
  • 45
  • 71
4
votes
1 answer

Udi Manber and Gene Myers Method

I have a suffix array SA, and an array L that stores the length of LCP (longest common prefix) between two consecutive suffixes, i.e. L[i]=LCP(SA[i-1],SA[i]) where 1<=i<=|SA| It's also described here. How should I use this array L to find the…
Ritesh Kumar Gupta
  • 5,055
  • 7
  • 45
  • 71
3
votes
3 answers

Complete Suffix Array

A suffix array will index all the suffixes for a given list of strings, but what if you're trying to index all the possible unique substrings? I'm a bit new at this, so here's an example of what I mean: Given the string abcd A suffix array indexes…
Arjun
  • 1,701
  • 4
  • 17
  • 25
3
votes
2 answers

Longest common substring via suffix array: do we really need unique sentinels?

I am reading about LCP arrays and their use, in conjunction with suffix arrays, in solving the "Longest common substring" problem. This video states that the sentinels used to separate individual strings must be unique, and not be contained in any…
Wad
  • 1,454
  • 1
  • 16
  • 33
1
2
3
10 11