9

Given a string s of length n, is it possible to count the number of distinct substrings in s in O(n)?

Example

Input: abb

Output: 5 ('abb', 'ab', 'bb', 'a', 'b')

I have done some research but i can't seem to find an algorithm that solves this problem in such an efficient way. I know a O(n^2) approach is possible, but is there a more efficient algorithm?

I don't need to obtain each of the substrings, just the total number of distinct ones (in case it makes a difference).

Andriy Ivaneyko
  • 20,639
  • 6
  • 60
  • 82
donrondon
  • 103
  • 1
  • 1
  • 5

2 Answers2

16

You can use Ukkonen's algorithm to build a suffix tree in linear time:

https://en.wikipedia.org/wiki/Ukkonen%27s_algorithm

The number of substrings of s is then the number of prefixes of strings in the trie, which you can calculate simply in linear time. It's just total number of characters in all nodes.

For instance, your example produces a suffix tree like:

            /\                
           b  a
           |  b
           b  b

5 characters in the tree, so 5 substrings. Each unique string is a path from the root ending after a different letter: abb, ab, a, bb, b. So the number of strings is the number of letters in the tree.

More precisely:

  • Every substring is the prefix of some suffix of the string;
  • All the suffixes are in the trie;
  • So there is a 1-1 correspondence between substrings and paths through the trie (by the definition of trie); and
  • There is a 1-1 correspondence between letters in the tree and non-empty paths, because:
    • each distinct non-empty path ends at a distinct position after its last letter; and
    • the path to the the position following each letter is unique

NOTE for people who are wondering how it could be possible to build a tree that contains O(N^2) characters in O(N) time:

There's a trick to the representation of a suffix tree. Instead of storing the actual strings in the nodes of the tree, you just store pointers into the orignal string, so the node that contains "abb" doesn't have "abb", it has (0,3) -- 2 integers per node, regardless of how long the string in each node is, and the suffix tree has O(N) nodes.

Matt Timmermans
  • 53,709
  • 3
  • 46
  • 87
  • Thanks for your answer. The wikipedia article you have referenced says that the Ukkonen's algorithm achieves O(n) time, but only for constant-sized alphabets, what does this means? Also, i don't understand why the number of substrings of `s` is the "total number of characters in all nodes" (of Ukkonen's resultant tree). – donrondon Jan 19 '16 at 20:18
  • "constant-sized alphabets" means there are a limited number of characters to choose from in the string, like 26 letters, or 256 bytes, or 65536 characters, etc. The alternative is suffix trees for sequences over infinite alphabets like arbitrary unbounded integers. – Matt Timmermans Jan 20 '16 at 01:27
  • I added some explanation to answer your other question – Matt Timmermans Jan 20 '16 at 01:45
  • I appreciate your effort, it's much clearer now. Checked as best answer. – donrondon Jan 20 '16 at 09:20
  • @MattTimmermans Say for example, my original string `s="abbabbab"`. Then, can you please explain what you will store in the node (for O(n) time complexity) and how will you make sure that you do not count the same substring twice? – Nannan AV Aug 31 '18 at 12:20
5

Construct the LCP array and subtract its sum from the number of substrings (n(n+1)/2).

David Eisenstat
  • 64,237
  • 7
  • 60
  • 120