Linear space data-structure supporing subsequence query on a static string

Question

Build a data-stucture from a given string S of length n which supports fast queries for checking whether an input string J of length m is a subsequence of S.

S is a static string and pre-processing time of the data-structure can be ignored.

Requirements:

The space consumption should be linear O(n)
The runtime of subsequence(J) should depend on m - not necessarily O(m) but, the faster the better.

What is subsequence?

A is a subsequence of B if A can be constructed by removing zero or more characters from B. I.e ABA is a subsequence of ADBDBAC

What I tried

A data-structure which supports the Subsequence(J) query stores pointers from each letter in S to the next occurrence in S of every letter in the alphabet.

Let A be an array of length n + 1. A contains hash-tables hashed over alphabet, σ. Each key-value pair (k,v) in the hash-table contains some letter k as key and it's next occurrence as value v.

The hash-table A_0 contains the first occurrence of every letter in the alphabet.
The hash-table A_1 contains the index of second occurrence for the letter at S_0 along with the first occurrence of the other letters.
The hash-table A_2 contains the index of second occurrence for the letters S_1 and S_2 assuming they are different letters - otherwise A_2 will contain the third index of the letter at S_1 - along with the first occurrence of the other letters and so on...

Example: If T is B C A D F B, ¥ represents the hashtable A_0 and represents a Ø null pointer, the data-structure would look like: |0 1 2 3 4 5 |¥ B C A D B A|3 3 3 Ø Ø Ø B|1 5 5 5 5 Ø C|2 2 Ø Ø Ø Ø D|4 4 4 4 Ø Ø

The alphabet \sigma is built from the letters in T and is static. Therefore, perfect hashing (FKS) can be used.

Running the query

To perform the Subsequence(J) query with the string J, we lookup the A-index of the first occurrence J_0 in S using A_0.

In the example we could query Subsequence("BAB") to test if BAB is a subsequence: * look-up B in column 0 which returns index 1 * look-up A in column 1 which returns index 3 * look-up B in column 3 which returns index 5

As long as we don't pass a null-pointer, the string is subsequence. The hash-lookups take constant time and we have to perform at most |J| of them the runtime is O(|J|).

The space consumption is O(|J|·|S|)

@wp78de My bad, first time poster - I've taken a look at the docs and added a what did I try section. — DannyDannyDanny, Jul 25 '18 at 13:44
@pkpnd yes, the string **S** is given before we build the datastructure. — DannyDannyDanny, Jul 25 '18 at 13:45
@DannyDannyDanny That's not what I asked. How many different letters are there? Could S contain all different characters, no matter how long S is? — k_ssb, Jul 27 '18 at 05:59
Ah, well, there is no restriction on what makes up **S**. However, considering we only start building the datastructure once **S** is given, one could say that the alphabet size is the number of unique letters in **S**. — DannyDannyDanny, Jul 27 '18 at 11:50

score 1 · Accepted Answer · answered Jul 20 '18 at 00:27

The simple and slow way to check whether or not J is a subsequence of S is:

Start at the beginning of S
For each character c in J, in order, move forward in S to the next occurrence of c.
Iff you make it to the end and find a match for every character, then J is a subsequence of S.

You can accelerate these searches by building a map from each character that occurs in S to a sorted array of the positions at which that character occurs.

Then, to find the next occurrence of a character in step (2), you can lookup the position array for that character and do a binary search in the array for the next occurrence after the current position.

Total worst-case complexity to do a subsequence check would be O(m log n).

Thanks, your idea was a big help and pointed me in the right direction. I found an improved method that maps to Y-fast-tries instead of sorted arrays. This gives a time complexity of **O(m log log n)**. — DannyDannyDanny, Jul 26 '18 at 08:53
@pkpnd not so - you don't need to use an array for the character -> positions map, and you don't need to store positions for characters that don't appear. — Matt Timmermans, Jul 27 '18 at 12:42

Linear space data-structure supporing subsequence query on a static string

Requirements:

What is subsequence?

What I tried

1 Answers1