Questions tagged [suffix-array]

A suffix array is a data structure that represents the lexicographically sorted list of all suffixes of a string (in the computer-science, not the linguistics, sense of the word suffix). It is the basis for many high-performance algorithms performed on very large strings, for example full-text search or compression.

A suffix array is a data structure that represents the lexicographically sorted list of all suffixes of a string. It is the basis for many high-performance algorithms performed on very large strings, for example full-text search or compression.

Formal definitions

String: A string is an ordered sequence of symbols, each taken from a pre-defined, finite set. That set is called alphabet, or character set. The symbols are often referred to as characters.

Suffix: Given a string T of length n, a suffix of T is defined as a substring that starts at any position of T and ends at position n (the end of T).

Example: Let T:=abc, then abc,bc and c are suffixes of T, but a and ab are not.

Remark: Any string T of length n has exactly n distinct suffixes (as many as there are characters in it), because any character is the beginning of exactly one suffix.

Suffix array: Given a string T of length n, and a linear ordering on the alphabet, the suffix array of T is the lexicographically sorted list of all suffixes of T.

Example: Let T:=abcabx and assume the 'natural' alphabetic ordering, i.e. a < b < c < d... < x < y < z. Then the suffix array of T is as follows.

abcabx
abx
bcabx
bx
cabx
x

Implementation

The suffix array is usually not explicitly stored in memory. Instead it is represented as a list of integers, each representing the starting position of a suffix.

abcabx 012345

Example: Given T as defined above, and assume a numbering of its positions from 0 to 5, the suffix array is represented as the list [0,3,1,4,2,5].

The suffix-array tag

Many of the questions tagged suffix-array are related to one of the topics below.

  • How to construct suffix arrays efficiently
  • How to store, and possibly compress, them efficiently
  • How to make use of them for various purposes, such as full-text search, detection of regularities in strings and text-compression
  • How they are used in various fields of application, in particular bioinformatics, genetics and natural language processing
  • What existing and/or ready-to-use implementations of any of the above are known
  • Worst-case, average-case and empirical comparisons of time and space requirements of existing algorithms and implementation
154 questions
3
votes
1 answer

Accessing the first Character of a String with no Characters

I am implementing a suffix trie in C++. The implementation of the Trie contructor can be seen below. #include #include #include "Trie.hpp" using namespace std; Trie::Trie(string T){ T += "#"; …
Luke Collins
  • 1,433
  • 3
  • 18
  • 36
3
votes
1 answer

Understanding implementation of DC3/Skew algorithm to create Suffix Array linear time

I am trying to understand implementation of linear time suffix array creation algorithm by Karkkainen, P. Sanders. Details of algorithm can be found here. I managed to understand overall concept but failing to match it with provided implementation…
Devendra Wani
  • 121
  • 2
  • 9
3
votes
1 answer

Minimum number of distinct characters to result in a given suffix array

Given a suffix array, a TopCoder task from SRM 630 asks to find the minium number of distinct characters in the string that could form a string with the given suffix array. The full problem statement can be found on the TopCoder website. The best…
Ariel
  • 1,222
  • 2
  • 14
  • 25
3
votes
1 answer

What's the significance of suffixes being sorted in suffix array?

I know that the definition of suffix array itself is that its a sorted array of all the suffixes of a string. But I am trying to understand whats the significance of this sorting operation here? Suppose we create an array of all the suffixes of the…
discoverAnkit
  • 1,141
  • 1
  • 11
  • 25
3
votes
2 answers

Implementation of suffix array in c++

#include #include #include #include using namespace std; struct xx { string x; short int d; int lcp; }; bool compare(const xx a,const xx b) { return a.x
Rajesh M
  • 634
  • 11
  • 31
3
votes
1 answer

What is considered the best Java Suffix Tree implementation?

I need a suffix tree Java Implementation. After some googling I concluded that the libdivsufsort C implementation is the best one around. Is there a Java implementation of the same (or almost as good) quality and that is preferably open source. The…
Koen Peters
  • 12,798
  • 6
  • 36
  • 59
3
votes
1 answer

Looking for ideas: lexicographically sorted suffix array of many different strings compute efficiently an LCP array

I don't want a direct solution to the problem that's the source of this question but it's this one link: So I take in the strings and add them to a suffix array which is implemented as a sorted set internally, what I obtain then is a…
Hugo
  • 2,913
  • 1
  • 20
  • 21
2
votes
2 answers

Compressing Suffix Arrays in Java

I have created a suffix array using the Princeton implementation. However, my basic text document is very, very large and the resulting suffix array is over 500mb in size. Is there a way to compress the suffix array? Thanks!
Bob Trop
  • 33
  • 5
2
votes
2 answers

Where would a suffix array be preferable to a suffix tree?

Two closely-related data structures are the suffix tree and suffix array. From what I've read, the suffix tree is faster, more powerful, more flexible, and more memory-efficient than a suffix array. However, in this earlier question, one of the…
templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
2
votes
4 answers

function for suffix array python

I want to write a function that outputs a suffix array. This is what I have so far: def suffixArray(s): sa = [] for i in range(len(s)): suffix= sorted([s[i:]]) sa = [len(s)-len(suffix[i:]) return list(sa) This outputs an…
2
votes
2 answers

Finding a substring while allowing for mismatches with Ruby

I was reading about a suffix array approach to look for substrings within strings see (http://www.codeodor.com/index.cfm/2007/12/24/The-Suffix-Array/1845) e.g. sa = SuffixArray.new("abracadabra") puts sa.find_substring("aca") where SuffixArray is…
eastafri
  • 2,186
  • 2
  • 23
  • 34
2
votes
1 answer

How can I find the occurence number of each suffix in a string?

I want to find how many times each suffix of a string occurs in the original string in O(nlogn) or O(n) time. For example, for string aba, suffix a appears twice, ba appears once, aba appears once.
newbie
  • 238
  • 3
  • 15
2
votes
1 answer

C: strcmp error while finding longest repeated substring

I'm trying to create a program which returns the Longest repeated substring. I've almost got the solution, but for some reason my strcmp gives an error when he is busy to find the LRS. Could someone explain me why there is an error and how I solve…
SpartanHero
  • 107
  • 8
2
votes
1 answer

Suffix Array Construction O(N LogN) - Competitive Programming 3 Steven Halim

I reading up the book Competitive Programming 3 by Steven Halim and Felix Halim I'm reading the chapter on Strings.I'm trying to understand the suffix array construction algorithm. I dont understand the radix sort part. (Although, I understand…
user1423561
  • 327
  • 1
  • 3
  • 18
2
votes
2 answers

Making LCP from Suffix Array

I am learning about Suffix arrays and Successful learnt how to make a Suffix array in O(nlognlogn) times From this Tutorial. Now i am wondering How would i create LCP Array from my Suffix Array in O(nlogn) time or better obviously i know the O(n*n)…
user4996457
  • 161
  • 1
  • 7
1 2
3
10 11