Questions tagged [suffix-array]

A suffix array is a data structure that represents the lexicographically sorted list of all suffixes of a string (in the computer-science, not the linguistics, sense of the word suffix). It is the basis for many high-performance algorithms performed on very large strings, for example full-text search or compression.

A suffix array is a data structure that represents the lexicographically sorted list of all suffixes of a string. It is the basis for many high-performance algorithms performed on very large strings, for example full-text search or compression.

Formal definitions

String: A string is an ordered sequence of symbols, each taken from a pre-defined, finite set. That set is called alphabet, or character set. The symbols are often referred to as characters.

Suffix: Given a string T of length n, a suffix of T is defined as a substring that starts at any position of T and ends at position n (the end of T).

Example: Let T:=abc, then abc,bc and c are suffixes of T, but a and ab are not.

Remark: Any string T of length n has exactly n distinct suffixes (as many as there are characters in it), because any character is the beginning of exactly one suffix.

Suffix array: Given a string T of length n, and a linear ordering on the alphabet, the suffix array of T is the lexicographically sorted list of all suffixes of T.

Example: Let T:=abcabx and assume the 'natural' alphabetic ordering, i.e. a < b < c < d... < x < y < z. Then the suffix array of T is as follows.

abcabx
abx
bcabx
bx
cabx
x

Implementation

The suffix array is usually not explicitly stored in memory. Instead it is represented as a list of integers, each representing the starting position of a suffix.

abcabx 012345

Example: Given T as defined above, and assume a numbering of its positions from 0 to 5, the suffix array is represented as the list [0,3,1,4,2,5].

The suffix-array tag

Many of the questions tagged suffix-array are related to one of the topics below.

  • How to construct suffix arrays efficiently
  • How to store, and possibly compress, them efficiently
  • How to make use of them for various purposes, such as full-text search, detection of regularities in strings and text-compression
  • How they are used in various fields of application, in particular bioinformatics, genetics and natural language processing
  • What existing and/or ready-to-use implementations of any of the above are known
  • Worst-case, average-case and empirical comparisons of time and space requirements of existing algorithms and implementation
154 questions
2
votes
1 answer

Understanding the algorithm for pattern matching using an LCP array

Foreword: My question is mainly an algorithmic question, so even if you are not familiar with suffix and LCP arrays you can probably help me. In this paper it is described how to efficiently use suffix and LCP arrays for string pattern matching. I…
Paddre
  • 798
  • 1
  • 9
  • 19
2
votes
2 answers

What is a Suffix Automaton?

Can someone please explain to me what exactly is a suffix automaton, and how it works and differs from suffix trees and suffix arrays? I have already tried searching on the web but was not able to come across any clear comprehensive explanation. I…
KayEs
  • 135
  • 8
2
votes
1 answer

Longest Common Substring using Suffix Automata

I used to calculate longest common Substring using dynamic programming O(m * n), suffix tree O(m + n), suffix array O(nlog^2 n) according to my need. Recently I have learnt Suffix Automaton which performs in O(n) which is very impressive. I can…
Kaidul
  • 15,409
  • 15
  • 81
  • 150
2
votes
0 answers

Latest research on suffix arrays vs suffix trees

I've been trying to ascertain whether suffix trees or suffix arrays (including their variants) are more space efficient (amongst other properties as given below), but I seem to be coming up with different viewpoints depending on where I look. This…
2
votes
1 answer

suffix array using manber myers algorithm

I have tried going through the theory in the paper http://webglimpse.net/pubs/suffix.pdf But I am kind of lost when they say Let Ai be the first suffix in the first bucket (i.e., Pos[0] = i), and consider Ai-h (if i-h < 0, then we ignore Ai and take…
self_noted
  • 119
  • 1
  • 9
2
votes
0 answers

Can we use suffix tree to count numbers of distinct subsequence?

Can we use suffix tree to count numbers of distinct subsequence (rather than substring)? Definition: A subsequence of a string is a new string which is formed from the original string by deleting some of the characters without disturbing the…
Eric H.
  • 341
  • 2
  • 7
  • 18
2
votes
1 answer

Longest Common Prefixes

Suppose I constructed a suffix array, i.e. an array of integers giving the starting positions of all suffixes of a string in lexicographical order. Example: For a string str=abcabbca, the suffix array is: suffixArray[] = [7 3 0 4 5 1 6…
Ritesh Kumar Gupta
  • 5,055
  • 7
  • 45
  • 71
1
vote
2 answers

string pattern match,the suffix array can solve this or have more solution?

i have a string that random generate by a special characters (B,C,D,F,X,Z),for example to generate a following string list: B D Z Z Z C D C Z B D C B Z Z Z D X D B Z F Z B D C C Z B D C F Z .......... i also have a pattern list, that is to match…
zhengchun
  • 1,261
  • 13
  • 19
1
vote
2 answers

String similarity in c

For two strings A and B, we define the similarity of the strings to be the length of the longest prefix common to both strings. For example, the similarity of strings "abc" and "abd" is 2, while the similarity of strings "aaa" and "aaab" is…
agasthyan
  • 725
  • 3
  • 8
  • 17
1
vote
1 answer

Efficient way to collect all unique substrings of a string

I need to identify all substrings in a string with a minimum size and repeats. The caveat is that I don't want substrings returned that are themselves substrings of other returned substrings. In other words the set of substrings needs to be a…
Eric
  • 1,381
  • 9
  • 24
1
vote
1 answer

Suffix Array, n^2*log(n) faster than n*log^2(n) even for large inputs?

I learned this theory in class and decided to implement everything to make sure I understood it all; but the theoretical results don't align with the memory and time usages I obtain. I'm pretty sure this is not a shortcoming of theoretical…
1
vote
2 answers

Implementations for Pattern/String mining using Suffix Arrays/Trees

I am trying to solve a pattern mining problem for strings and I think that suffix trees or arrays might be a good option to solve this problem. I will quickly outline the problem: I have a set strings of different lengths (quotation are just to mark…
Pearson
  • 109
  • 1
  • 1
  • 9
1
vote
2 answers

How to find the longest common substring of n strings using suffix array?

I could do longest common substring using two strings each time. But consider 3 strings below: ABZDCC ABZDEC EFGHIC Here we see that the lcs of the first two strings is ABZD. But when this will be compared to the third string, the length of lcs…
1
vote
2 answers

Naive suffix array optimisation c++

Do you have an idea how to optimize the following function while still using std::sort(). It sorts the suffixes of a text to create a suffix array. I think the problem is in the compare function as not much more can be done for the rest. compare is…
5kobrat
  • 45
  • 9
1
vote
1 answer

Find maximum binary number with rotations, exceed time limit

Passed some test cases, but after submission, the time limit exceeded. How to optimize solution to reduce time complexity? A large binary number is represented by a string A of size N and comprises of 0s and 1s. You must perform a cyclic shift on…
pepega
  • 25
  • 8