Questions tagged [suffix-array]

A suffix array is a data structure that represents the lexicographically sorted list of all suffixes of a string (in the computer-science, not the linguistics, sense of the word suffix). It is the basis for many high-performance algorithms performed on very large strings, for example full-text search or compression.

A suffix array is a data structure that represents the lexicographically sorted list of all suffixes of a string. It is the basis for many high-performance algorithms performed on very large strings, for example full-text search or compression.

Formal definitions

String: A string is an ordered sequence of symbols, each taken from a pre-defined, finite set. That set is called alphabet, or character set. The symbols are often referred to as characters.

Suffix: Given a string T of length n, a suffix of T is defined as a substring that starts at any position of T and ends at position n (the end of T).

Example: Let T:=abc, then abc,bc and c are suffixes of T, but a and ab are not.

Remark: Any string T of length n has exactly n distinct suffixes (as many as there are characters in it), because any character is the beginning of exactly one suffix.

Suffix array: Given a string T of length n, and a linear ordering on the alphabet, the suffix array of T is the lexicographically sorted list of all suffixes of T.

Example: Let T:=abcabx and assume the 'natural' alphabetic ordering, i.e. a < b < c < d... < x < y < z. Then the suffix array of T is as follows.

abcabx
abx
bcabx
bx
cabx
x

Implementation

The suffix array is usually not explicitly stored in memory. Instead it is represented as a list of integers, each representing the starting position of a suffix.

abcabx 012345

Example: Given T as defined above, and assume a numbering of its positions from 0 to 5, the suffix array is represented as the list [0,3,1,4,2,5].

The suffix-array tag

Many of the questions tagged suffix-array are related to one of the topics below.

  • How to construct suffix arrays efficiently
  • How to store, and possibly compress, them efficiently
  • How to make use of them for various purposes, such as full-text search, detection of regularities in strings and text-compression
  • How they are used in various fields of application, in particular bioinformatics, genetics and natural language processing
  • What existing and/or ready-to-use implementations of any of the above are known
  • Worst-case, average-case and empirical comparisons of time and space requirements of existing algorithms and implementation
154 questions
1
vote
0 answers

String Indexing and Suffix Trees

I have to build some kind of a "string catalogue" out of large PDF documents for faster string/substring searches. The mechanism should work like this: A PDF scanner scans the PDF document for strings and invokes a callback-method in my catalogue to…
Hasib Samad
  • 1,081
  • 1
  • 20
  • 39
1
vote
2 answers

Why does this example use null padding in string comparisons? “Programming Pearls”: Strings of Pearls

In "Programming Pearls": Strings of Pearls, section 15.3 (Generating Text), the author introduces how to generate random text from an input document. In the source code, there are some things that I don't understand. for (i = 0; i < k; i++) …
Fihop
  • 3,127
  • 9
  • 42
  • 65
0
votes
1 answer

Practical implementation of suffix array

Looking for a practical implementation of suffix arrays, I came across this paper. It outlines a O(n (log n * log n)) approach, where n is the length of the string. While there are faster algorithms available, IMO, none is suitable in a programming…
Abhijit Sarkar
  • 21,927
  • 20
  • 110
  • 219
0
votes
0 answers

How to solve longest already present substring in O(n)?

Given a string a I need to find for every position i in a the length of the longest substring b such that it starts in position i and was already present in a, which means that there exists i'
quicker
  • 1
  • 1
0
votes
0 answers

DC3/Skew Suffix Array Algorithm doesn't work for specific cases

When applying the DC3/Skew algorithm to the string yabadabado, I can't quite get it to sort correctly. This issue happens in other cases, but this is a short example to show it. This first table is for reference: These are the triples of R12 We…
Brother58697
  • 2,290
  • 2
  • 4
  • 12
0
votes
0 answers

Segmentation Fault in Suffix Array Implementation

when I run the following code written to implement the suffix array algorithm, I get a segmentation fault. Can anyone please help me in solving it? I think the issue is with the while since cout in all other places work properly Following is the…
Shakya Peiris
  • 504
  • 5
  • 11
0
votes
1 answer

What is the Big O notation of a program?

I am trying to determine the algorithmic complexity of this program: import java.io.BufferedReader; import java.io.IOException; import java.io.InputStreamReader; public class SuffixArray { private String[] text; private int length; …
0
votes
3 answers

I want to define (create) and use different variable (using suffix) through loop (specially for loop)

I want to create multiple variable through for loop to further use and compare in the program. Here is the code - for i in range(0,len(Header_list)): (f'len_{i} = {len(Header_list[i]) + 2 }') print(len_0);print(f'len{i}') for company in…
0
votes
1 answer

linear time algorithm for finding most frequent m-letter substring in a string

Suppose we have a n letter string and we are searching for most repeated m letter substring (1=
sonia
  • 25
  • 4
0
votes
0 answers

Most repeated substring in a string in C

Due to our problem, I need to find the most repeated substring in a string. The way I followed for this was as follows: I found the suffixes of the string, then I found the prefixes of the suffixes and assigned them to a matrix array. Finally, I…
Yodax93
  • 21
  • 2
0
votes
1 answer

Which character to append to string in suffix array?

I was solving https://www.spoj.com/problems/BEADS/ above question at SPOJ. I have stated the relevant information below: Problem Statement: The description of the necklace is a string A = a1a2 ... am specifying sizes of the particular beads, where…
Bully Maguire
  • 211
  • 3
  • 15
0
votes
1 answer

question of skew suffix array algorithm presented in Simple Linear Work Suffix Array Construction

At the last of paper 'Simple Linear Work Suffix Array Construction' source code is attached, I cannot understand this part, // generate positions of mod 1 and mod 2 suffixes // the "+(n0-n1)" adds a dummy mod 1 suffix if n%3 == 1 for (int i=0, j=0;…
yewei
  • 241
  • 2
  • 9
0
votes
0 answers

Time Complexity of Code (Recursive function for calculating suffix array of a string in sorted (ascending) order)

Algorithm:- s --> given string whose suffix strings are to be sorted (let s = 'ababba') add a special symbol '$' at end of 's', '$' is smallest character. (s = 'ababba$' Represent all suffix strings by integers representing the beginning index of…
user12619063
0
votes
0 answers

what is the time complexity of my suffix array generation code

I am trying to create a suffix array using python language.My program is not passing all of the test cases and shows timeout error. I think its time complexity is O(NlogNlogN). here is my code : def criteria1(s): return s[0] def criteria2(s): …
Mr Sukhe
  • 67
  • 9
0
votes
1 answer

Suffix Array sentinel character lexicographical order

This question is based on this answer by jogojapan. In that answer, he notes that for some suffix tree/suffix array algorithms, just having a unique sentinel character $ is sufficient, while others require $ to either lexicographically compare…
helloworld922
  • 10,801
  • 5
  • 48
  • 85