2

Can someone please explain to me what exactly is a suffix automaton, and how it works and differs from suffix trees and suffix arrays? I have already tried searching on the web but was not able to come across any clear comprehensive explanation.

I found the following link closest to what I wanted but it is in Russian and the translation to English was difficult to understand.

http://e-maxx.ru/algo/suffix_automata

twasbrillig
  • 17,084
  • 9
  • 43
  • 67
KayEs
  • 135
  • 8

2 Answers2

4

A suffix automaton is a finite-state machine that recognizes all the suffixes of a string. There's a lot of resources on finite-state machines you can read, Wikipedia being a good start.

Suffix trees and suffix arrays are data structures containting all the suffixes of a string. There are multiple algorithms to build and act on these structures to perform operations efficiently on strings.

Vinicius Braz Pinto
  • 8,209
  • 3
  • 42
  • 60
3

Suffix Machine:

Suffix machine (or a directed acyclic graph of words) is a powerful data structure that allows to solve many string problems.

For example, using the suffix of the machine, you can search for all occurrences of one string into another, or to count the number of different substrings of the given string - both tasks it can solve in linear time.

On an intuitive level, suffix automaton can be understood as concise information about all the substrings of a given string. An impressive fact is that the suffix automaton contains all the information in such a concise form, which for a string of length n it requires only a O(n) memory. Moreover, it can also be built over time O(n) (if we consider the size of the alphabet k constant; otherwise, during O (n log k)).

Historically, the first linear size suffix of the machine was opened in 1983 Blumer and others, and in 1985 - 1986 he was presented the first algorithms build in linear time (Crochemore, Blumer and others). For more detail see references at the end of the article.

In English the suffix machine called "suffix automaton" (in the plural - "suffix automata"), and a directed acyclic graph of the words "directed acyclic word graph (or simply "DAWG").

The definition of the suffix automaton:

Definition. The suffix automaton for the given string s is called a minimal deterministic finite automaton that accepts all suffixes of the string s.

We will explain this definition.

  • Suffix automaton is a directed acyclic graph, in which vertices are called States, and the arcs of the graph is the transitions between
    these States.
  • One of the States t_0 is called the initial state, and it must be the origin of the graph (i.e. it achievable for all other States).
  • Each transition in the automaton is arc marked with some symbol. All transitions originating from any state must have different labels. (On the other hand, may not be transitions for any characters.)
  • One or more of the conditions marked as terminal States. If we go from the initial state t_0 any way to any terminal state, and let us
    write this label all arcs traversed, you get a string, which must be
    one of the suffixes of the string s.
  • The suffix automaton contains the minimum number of vertices among all the machines that satisfy the above conditions. (The minimum
    number of transitions is not required because the condition of
    minimality of the number of States in the machine may not be "extra"
    ways - otherwise it would break the previous property.)

Elementary properties of the suffix automaton:

The simplest, and yet most important property of the suffix automaton is that it contains information about all the substrings of the string s. Namely, any path from the initial state t_0 if we write out the labels of the arcs along this path, forms necessarily a substring of a string s. Conversely, any substring of the string s corresponds to some path starting in the initial state t_0.

In order to simplify the explanation, we will say that a substring corresponds to the path from the initial state, the labels along which form the substring. Conversely, we will say that any path corresponds to one row which is formed by the labels of its arcs.

In each state machine suffix is one or more paths from the initial state. Let's say that the state corresponds to the set of strings that match all of these ways.

EXAMPLES:

enter image description here

Mehran
  • 308
  • 3
  • 15