how to find distinct substrings?

Question

Given a string, and a fixed length l, how can I count the number of distinct substrings whose length is l? The size of character set is also known. (denote it as s) For example, given a string "PccjcjcZ", s = 4, l = 3, then there are 5 distinct substrings: “Pcc”; “ccj”; “cjc”; “jcj”; “jcZ”

I try to use hash table, but the speed is still slow. In fact I don't know how to use the character size. I have done things like this

int diffPatterns(const string& src, int len, int setSize) {
  int cnt = 0;
  node* table[1 << 15];
  int tableSize = 1 << 15;
  for (int i = 0; i < tableSize; ++i) {
    table[i] = NULL;
  }

  unsigned int hashValue = 0;

  int end = (int)src.size() - len;

  for (int i = 0; i <= end; ++i) {
    hashValue = hashF(src, i, len);
    if (table[hashValue] == NULL) {
      table[hashValue] = new node(i);
      cnt ++;
    } else {
      if (!compList(src, i, table[hashValue], len)) {
        cnt ++;
      };
    }
  }

  for (int i = 0; i < tableSize; ++i) {
    deleteList(table[i]);
  }

  return cnt;
}

Do you actually need to find all of the sub-strings or just the number of them? — NathanOliver, Apr 08 '15 at 14:09
Whats wrong with your code? You only want to make it faster? — 463035818_is_not_an_ai, Apr 08 '15 at 14:09
Just an idea: If there are no repeating substrings, you can simply calculate the number of distinct substrings from the lenght of the string. Maybe its easier to count and find the repeating substrings? — 463035818_is_not_an_ai, Apr 08 '15 at 14:11
Build a DFA trie-style as you scan, every time you add a new node in the third tier, it's a new distinct substring. — Ben Voigt, Apr 08 '15 at 14:17
I wouldn't be surprised if this code spends most of its time filling `2^15` element array with `NULL`s. Just use `unordered_set` or similar. — Igor Tandetnik, Apr 08 '15 at 14:18

Veronica Kham · Answer 1 · 2015-04-08T14:42:16.800

Hastables are fine and practical, but keep in mind that if the length of substrings is L, and the whole string length is N, then the algorithm is Theta((N+1-L)*L) which is Theta(NL) for most L. Remember, just computing the hash takes Theta(L) time. Plus there might be collisions.

Suffix trees can be used, and provide a guaranteed O(N) time algorithm (count number of paths at depth L or greater), but the implementation is complicated. Saving grace is you can probably find off the shelf implementations in the language of your choice.

score 1 · Answer 2 · edited May 23 '17 at 12:28

1

The idea of using a hashtable is good. It should work well.

The idea of implementing your own hashtable as an array of length 2^15 is bad. See Hashtable in C++? instead.

edited May 23 '17 at 12:28

Community

1
1

answered Apr 08 '15 at 14:20

Douglas Zare

3,296
1
14
21

score 0 · Answer 3 · answered Apr 08 '15 at 14:23

You can use an unorder_set and insert the strings into the set and then get the size of the set. Since the values in a set are unique it will take care of not including substrings that are the same as ones previously found. This should give you close to O(StringSize - SubstringSize) complexity

#include <iostream>
#include <string>
#include <unordered_set>


int main()
{
    std::string test = "PccjcjcZ";
    std::unordered_set<std::string> counter;
    size_t substringSize = 3;
    for (size_t i = 0; i < test.size() - substringSize + 1; ++i)
    {
        counter.insert(test.substr(i, substringSize));
    }

    std::cout << counter.size();

    std::cin.get();
    return 0;
}

score 0 · Answer 4 · edited Jun 20 '20 at 09:12

Veronica Kham answered good to the question, but we can improve this method to expected O(n) and still use a simple hash table rather than suffix tree or any other advanced data structure.

Hash function

Let X and Y are two adjacent substrings of length L, more precisely:

X = A[i, i + L - 1]

Y = B[i + 1, i + 1 + L - 1]

Let assign to each letter of our alphabet a single non negative integer, for example a := 1, b := 2 and so on.

Let's define a hash function h now:

h(A[i, j]) := (P^(L-1) * A[i] + P^(L-2) * A[i + 1] + ... + A[j]) % M

where P is a prime number ideally greater than the alphabet size and M is a very big number denoting the number of different possible hashes, for example you can set M to maximum available unsigned long long int in your system.

Algorithm

The crucial observation is the following:

If you have a hash computed for X, you can compute a hash for Y in O(1) time.

Let assume that we have computed h(X), which can be done in O(L) time obviously. We want to compute h(Y). Notice that since X and Y differ by only 2 characters, and we can do that easily using addition and multiplication:

h(Y) = ((h(X) - P^L * A[i]) * P) + A[j + 1]) % M

Basically, we are subtracting letter A[i] multiplied by its coefficient in h(X), multiplying the result by P in order to get proper coefficients for the rest of letters and at the end, we are adding the last letter A[j + 1].

Notice that we can precompute powers of P at the beginning and we can do it modulo M.

Since our hashing functions returns integers, we can use any hash table to store them. Remember to make all computations modulo M and avoid integer overflow.

Collisions

Of course, there might occur a collision, but since P is prime and M is really huge, it is a rare situation.

If you want to lower the probability of a collision, you can use two different hashing functions, for example by using different modulo in each of them. If probability of a collision is p using one such function, then for two functions it is p^2 and we can make it arbitrary small by this trick.

score 0 · Answer 5 · answered Apr 08 '15 at 19:41

0

Use Rolling hashes.

This will make the runtime expected O(n).

This might be repeating pkacprzak's answer, except, it gives a name for easier remembrance etc.

answered Apr 08 '15 at 19:41

Master Gill Bates

19
2

score 0 · Answer 6 · answered Apr 09 '15 at 09:01

0

Suffix Automaton also can finish it in O(N).

It's easy to code, but hard to understand.

Here are papers about it http://dl.acm.org/citation.cfm?doid=375360.375365

http://www.sciencedirect.com/science/article/pii/S0304397509002370

answered Apr 09 '15 at 09:01

Tio Plato

132
1
9

how to find distinct substrings?

6 Answers6

Hash function

Algorithm

Collisions