Build a data-stucture from a given string S of length n which supports fast queries for checking whether an input string J of length m is a subsequence of S.
S is a static string and pre-processing time of the data-structure can be ignored.
Requirements:
- The space consumption should be linear
O(n)
- The runtime of
subsequence(J)
should depend on m - not necessarilyO(m)
but, the faster the better.
What is subsequence?
A is a subsequence of B if A can be constructed by removing zero or more characters from B. I.e ABA
is a subsequence of ADBDBAC
What I tried
A data-structure which supports the Subsequence(J)
query stores pointers from each letter in S to the next occurrence in S of every letter in the alphabet.
Let A be an array of length n + 1. A contains hash-tables hashed over alphabet, σ. Each key-value pair (k,v) in the hash-table contains some letter k as key and it's next occurrence as value v.
The hash-table A_0 contains the first occurrence of every letter in the alphabet.
The hash-table A_1 contains the index of second occurrence for the letter at S_0 along with the first occurrence of the other letters.
The hash-table A_2 contains the index of second occurrence for the letters S_1 and S_2 assuming they are different letters - otherwise A_2 will contain the third index of the letter at S_1 - along with the first occurrence of the other letters and so on...
Example: If T is B C A D F B
, ¥ represents the hashtable A_0 and represents a Ø null pointer, the data-structure would look like:
|0 1 2 3 4 5
|¥ B C A D B
A|3 3 3 Ø Ø Ø
B|1 5 5 5 5 Ø
C|2 2 Ø Ø Ø Ø
D|4 4 4 4 Ø Ø
The alphabet \sigma is built from the letters in T and is static. Therefore, perfect hashing (FKS) can be used.
Running the queryTo perform the Subsequence(J)
query with the string J, we lookup the A-index of the first occurrence J_0 in S using A_0.
In the example we could query Subsequence("BAB")
to test if BAB
is a subsequence:
* look-up B
in column 0 which returns index 1
* look-up A
in column 1 which returns index 3
* look-up B
in column 3 which returns index 5
As long as we don't pass a null-pointer, the string is subsequence. The hash-lookups take constant time and we have to perform at most |J| of them the runtime is O(|J|).
The space consumption is O(|J|·|S|)