Questions tagged [suffix-array]

A suffix array is a data structure that represents the lexicographically sorted list of all suffixes of a string (in the computer-science, not the linguistics, sense of the word suffix). It is the basis for many high-performance algorithms performed on very large strings, for example full-text search or compression.

A suffix array is a data structure that represents the lexicographically sorted list of all suffixes of a string. It is the basis for many high-performance algorithms performed on very large strings, for example full-text search or compression.

Formal definitions

String: A string is an ordered sequence of symbols, each taken from a pre-defined, finite set. That set is called alphabet, or character set. The symbols are often referred to as characters.

Suffix: Given a string T of length n, a suffix of T is defined as a substring that starts at any position of T and ends at position n (the end of T).

Example: Let T:=abc, then abc,bc and c are suffixes of T, but a and ab are not.

Remark: Any string T of length n has exactly n distinct suffixes (as many as there are characters in it), because any character is the beginning of exactly one suffix.

Suffix array: Given a string T of length n, and a linear ordering on the alphabet, the suffix array of T is the lexicographically sorted list of all suffixes of T.

Example: Let T:=abcabx and assume the 'natural' alphabetic ordering, i.e. a < b < c < d... < x < y < z. Then the suffix array of T is as follows.

abcabx
abx
bcabx
bx
cabx
x

Implementation

The suffix array is usually not explicitly stored in memory. Instead it is represented as a list of integers, each representing the starting position of a suffix.

abcabx 012345

Example: Given T as defined above, and assume a numbering of its positions from 0 to 5, the suffix array is represented as the list [0,3,1,4,2,5].

The suffix-array tag

Many of the questions tagged suffix-array are related to one of the topics below.

  • How to construct suffix arrays efficiently
  • How to store, and possibly compress, them efficiently
  • How to make use of them for various purposes, such as full-text search, detection of regularities in strings and text-compression
  • How they are used in various fields of application, in particular bioinformatics, genetics and natural language processing
  • What existing and/or ready-to-use implementations of any of the above are known
  • Worst-case, average-case and empirical comparisons of time and space requirements of existing algorithms and implementation
154 questions
0
votes
0 answers

fast regex matching in list

Regular Expressions: Search in list I want to have a fast way to match a regex in a list. The above solution has a linear complexity wrt the length of the list. Given things like suffix-array or suffix-tree for pure string search, is there something…
user1424739
  • 11,937
  • 17
  • 63
  • 152
0
votes
1 answer

Longest common substring via suffix array: uses of sentinel

I am reading about the (apparently) well known problem of the longest common substring in a series of strings, and have been following these two videos which talk about how to solve the problem using suffix arrays: (note that this question doesn't…
Wad
  • 1,454
  • 1
  • 16
  • 33
0
votes
1 answer

Find all occurrences using binary search in a suffix array

I was wondering if there is an implemented way to get all the occurrences of a given substring and the suffix array. I was testing a function which I found in here: https://hg.python.org/cpython/file/2.7/Lib/bisect.py which some modifications. What…
Joe Smith
  • 83
  • 1
  • 6
0
votes
1 answer

compressed suffix array in python

Is there a implementation about compressed suffix array Psi in python? I actually understand how suffix arrays works and to get Psi given a suffix array but is there a way to get this byusing python?. I was searching if there was some library or…
Steve Jade
  • 131
  • 3
  • 12
0
votes
1 answer

Why do I get an EXC_BAD_ACCESS error when I try to create a two suffix arrays?

My task involves finding the longest common substring in two txt files using suffix arrays. I have done the following: #include #include #include #include int main() { char* charArrayA =…
Sebastian
  • 11
  • 1
0
votes
1 answer

Why the Suffix Array use less space than the Suffix Tree?

I'm researching about Suffix Array and Suffix Tree for my project. In several papers such as : "Suffix arrays: A new method for on-line string searches" by Manber and Myers - 1993. "Simple Linear Work Suffix Array Construction" by Juha Karkkainen…
Tín Tr.
  • 319
  • 4
  • 14
0
votes
1 answer

Longest Suffix-Prefix Overlap Algorithm

I have been tasked with identifying an efficient algorithm [O(n*log(n))] that, given a set of k Strings S = {s-1, s-2, s-3, ..., s-k}, will identify the longest substring T for each pair of strings (s-i, s-j), such that T is a suffix of s-i and a…
0
votes
1 answer

What is the advantage of Suffix tree over suffix array?

I have been studying about trie, suffix array and suffix tree.I know these data structures can be used to fast lookup and for many more applications. Now my question is, If suffix array is space efficient and easy to implement than what are the…
0
votes
0 answers

Decide if any suffix of a string is a prefix of another string

How can I check if any suffix of a string is a prefix of an another string, preferably in constant time, if I already built a suffix array of the first string? Example: string 1: abc string 2: bcda bc is a substring of first string, and a prefix…
0
votes
0 answers

sum of LCP of all pairs of substrings of a given string

How to find the sum of length of Longest Common Prefixes of all pairs of substrings of a given string. For eg answer for string "aba" is 8. |s|<=1e5.
0
votes
0 answers

Suffix array & Binary Search

I have been following a tutorial I found. It is however in C++ and I'm using Java so there might have been a few things lost in translation. I've tried both googling and searching here and while there seem to be plenty of asked questions I still…
Nick Z
  • 1
0
votes
2 answers

Longest repeated substring with at least k occurrences correctness

The algorithms for finding the longest repeated substring is formulated as follows 1)build the suffix tree 2)find the deepest internal node with at least k leaf children But I cannot understand why is this works,so basically what makes this…
0
votes
0 answers

Suffix Array and Suffix Tree

what are the shortcomings of suffix tree over suffix array and vice versa ? Why we need two different thing is there any instance where suffix array fails but suffix tree succeeded and vice versa
user6250837
  • 458
  • 2
  • 21
0
votes
2 answers

How do we Construct LCP-LR array from LCP array?

To find the number of occurrences of a given string P ( length m ) in a text T ( length N ) We must use binary search against the suffix array of T. The issue with using standard binary search ( without the LCP information ) is that in each of the…
pavaniiitn
  • 311
  • 1
  • 4
  • 18
0
votes
0 answers

Suffix Array sort function

Here I am trying to use inbuilt sort function. 's' is the array which stores the index. I am trying to sort this array according to suffix strings. If I am using qsort() function it works fine. But as soon as sort() used it sorts simply s array,…