Questions tagged [suffix-array]

A suffix array is a data structure that represents the lexicographically sorted list of all suffixes of a string (in the computer-science, not the linguistics, sense of the word suffix). It is the basis for many high-performance algorithms performed on very large strings, for example full-text search or compression.

A suffix array is a data structure that represents the lexicographically sorted list of all suffixes of a string. It is the basis for many high-performance algorithms performed on very large strings, for example full-text search or compression.

Formal definitions

String: A string is an ordered sequence of symbols, each taken from a pre-defined, finite set. That set is called alphabet, or character set. The symbols are often referred to as characters.

Suffix: Given a string T of length n, a suffix of T is defined as a substring that starts at any position of T and ends at position n (the end of T).

Example: Let T:=abc, then abc,bc and c are suffixes of T, but a and ab are not.

Remark: Any string T of length n has exactly n distinct suffixes (as many as there are characters in it), because any character is the beginning of exactly one suffix.

Suffix array: Given a string T of length n, and a linear ordering on the alphabet, the suffix array of T is the lexicographically sorted list of all suffixes of T.

Example: Let T:=abcabx and assume the 'natural' alphabetic ordering, i.e. a < b < c < d... < x < y < z. Then the suffix array of T is as follows.

abcabx
abx
bcabx
bx
cabx
x

Implementation

The suffix array is usually not explicitly stored in memory. Instead it is represented as a list of integers, each representing the starting position of a suffix.

abcabx 012345

Example: Given T as defined above, and assume a numbering of its positions from 0 to 5, the suffix array is represented as the list [0,3,1,4,2,5].

The suffix-array tag

Many of the questions tagged suffix-array are related to one of the topics below.

  • How to construct suffix arrays efficiently
  • How to store, and possibly compress, them efficiently
  • How to make use of them for various purposes, such as full-text search, detection of regularities in strings and text-compression
  • How they are used in various fields of application, in particular bioinformatics, genetics and natural language processing
  • What existing and/or ready-to-use implementations of any of the above are known
  • Worst-case, average-case and empirical comparisons of time and space requirements of existing algorithms and implementation
154 questions
0
votes
1 answer

Suffix array/suffix tree with numbers

Can suffix trees or suffix arrays be used effectively with numbers? For example: Can it be used with the array [1,2,3,4,5,3,9,8,5,3,9,8,6,4,5,3,9,11,9,8,7,11] to extract all possible non-overlapping repeating sub-strings of all sizes from the…
0
votes
1 answer

Generalized Suffix Tree Java Implementation For Large Datasets

I have a collection of around 50 millions strings, each has around 100 characters. I am looking for very efficient (running time and memory usage) generalized suffix tree implementation. I have tried https://github.com/npgall/concurrent-trees but…
0
votes
2 answers

Finding the K th lexicographically substring of a given string when the duplicate substrings are allowed

I want to find lexicographically Kth smallest substring of a given string when the duplicate substrings are allowed. Suppose we are given a string abc then its substrings in lexicographical order are {a,ab,abc,b,c}, now suppose we are given K = 3…
maddman
  • 1
  • 1
  • 5
0
votes
3 answers

Nested loops in Ruby

I am trying to count the number of similar prefix beginnings to a string in Ruby. e.g; input "ababaa" should output 11; ababaa = 6 babaa = 0 abaa = 3 baa = 0 aa = 1 a = 1 I have got as far as the code below, using a nested loop to go…
Barris
  • 969
  • 13
  • 29
0
votes
0 answers

Suffix array O(NlogN) implementation

I'm looking into the specific O(NlogN) implementation of suffix array found at this link : https://sites.google.com/site/indy256/algo/suffix_array I'm able to understand the core concepts but understanding the implementation in its entirety is a…
Gokul M
  • 51
  • 5
0
votes
0 answers

Implementing Suffix Array using bucket sort in java

Lets say we have List of suffixes for String banana - banana, anana, nana, ana, na, a. I was able to put each suffix in a bucket. like suffix that starts from 'a' will be in one bucket and suffix that starts with 'b' will be in other bucket. Same…
Ankit Pandoh
  • 548
  • 5
  • 14
0
votes
1 answer

count number of occurence of each substring?

Given a string S, I want to calculate number of substrings which occurred n times (1 <= n <= s.length()). I have done it with rolling hash, it can be done by using a suffix tree. How can it be solved using a suffix array in complexity O( n^2 )…
rishabh
  • 49
  • 8
0
votes
2 answers

How to construct a suffix Trie for all the substrings of a string?

I want to build a space efficient suffix trie for all the substrings of the string; Suppose the length of the string is 5000, then number of substring would be approx 25*10^6 and for every node i m storing an array of linkd of size 26 so total…
0
votes
1 answer

How to implement Suffix array in c

However its easy to get the implementation using C++ as there is built-in Sort() function in algorithm header file. I have gone through the both naive method and O(nlogn) methods of forming the array. In both the cases the sort() function is used…
Kalu
  • 290
  • 2
  • 13
0
votes
1 answer

Searching suffixes using a suffix array

I have constructed a suffix array which is implemented by a ArrayList. I want to use this list to search a suffix in the suffix array. For this I have sorted teh list and used binary search but the "search" function keeps returning -1 I don't know…
Aneesh K
  • 147
  • 2
  • 10
0
votes
1 answer

Binary search on suffix array

My code calculates the starting position of the intervall correctly but not the end position: int left; int bot = 0; int top = textLength; while(bot != top) { int mid = (bot+top)/2; …
tenacious
  • 91
  • 3
  • 11
0
votes
0 answers

Suffix Array Construction Algorithm

Can Somebody explain me the working of Suffix Array Construction Algorithm given in e-maxx.ru page I am unable to understand its code A explanation with a small example can be quite effective Link : http://e-maxx.ru/algo/suffix_array
0
votes
3 answers

How to sort a suffix array?

I am looking at example suffix arrays and longest common prefixes, but I do not understand the criteria for how the the suffix array is sorted. I am looking at the example on wikipedia where they use banana$ Can someone please explain how a suffix…
user137717
  • 2,005
  • 4
  • 25
  • 48
0
votes
1 answer

Fast way to find substring in text using suffix array and lcp

I'm trying to find words which contains substring (as input) in huge text. The text looks like this: *america*python*erica*escape*.. Example: Input: "rica" => Output: america,erica I use suffix array. My pseudocode (pythonlike)…
user3620512
  • 39
  • 1
  • 8
0
votes
1 answer

Number of distinct substrings with given prefix and suffix

Suppose I am given a string S. I need to find the number of distinct substrings of S that contain S1 as the prefix and S2 as the suffix. The range of S, S1 and S2 can be very large, that is, O(10^5). For eg. Suppose S is "abcdcd", S1 is "ab" and S2…