-2

Build a data-stucture from a given string S of length n which supports fast queries for checking whether an input string J of length m is a subsequence of S.

S is a static string and pre-processing time of the data-structure can be ignored.


Requirements:

  • The space consumption should be linear O(n)
  • The runtime of subsequence(J) should depend on m - not necessarily O(m) but, the faster the better.

What is subsequence?

A is a subsequence of B if A can be constructed by removing zero or more characters from B. I.e ABA is a subsequence of ADBDBAC

What I tried

A data-structure which supports the Subsequence(J) query stores pointers from each letter in S to the next occurrence in S of every letter in the alphabet.

Let A be an array of length n + 1. A contains hash-tables hashed over alphabet, σ. Each key-value pair (k,v) in the hash-table contains some letter k as key and it's next occurrence as value v.

  • The hash-table A_0 contains the first occurrence of every letter in the alphabet.

  • The hash-table A_1 contains the index of second occurrence for the letter at S_0 along with the first occurrence of the other letters.

  • The hash-table A_2 contains the index of second occurrence for the letters S_1 and S_2 assuming they are different letters - otherwise A_2 will contain the third index of the letter at S_1 - along with the first occurrence of the other letters and so on...

Example: If T is B C A D F B, ¥ represents the hashtable A_0 and represents a Ø null pointer, the data-structure would look like: |0 1 2 3 4 5 |¥ B C A D B A|3 3 3 Ø Ø Ø B|1 5 5 5 5 Ø C|2 2 Ø Ø Ø Ø D|4 4 4 4 Ø Ø

The alphabet \sigma is built from the letters in T and is static. Therefore, perfect hashing (FKS) can be used.

Running the query

To perform the Subsequence(J) query with the string J, we lookup the A-index of the first occurrence J_0 in S using A_0.

In the example we could query Subsequence("BAB") to test if BAB is a subsequence: * look-up B in column 0 which returns index 1 * look-up A in column 1 which returns index 3 * look-up B in column 3 which returns index 5

As long as we don't pass a null-pointer, the string is subsequence. The hash-lookups take constant time and we have to perform at most |J| of them the runtime is O(|J|).

The space consumption is O(|J|·|S|)

DannyDannyDanny
  • 838
  • 9
  • 26

1 Answers1

1

The simple and slow way to check whether or not J is a subsequence of S is:

  1. Start at the beginning of S
  2. For each character c in J, in order, move forward in S to the next occurrence of c.
  3. Iff you make it to the end and find a match for every character, then J is a subsequence of S.

You can accelerate these searches by building a map from each character that occurs in S to a sorted array of the positions at which that character occurs.

Then, to find the next occurrence of a character in step (2), you can lookup the position array for that character and do a binary search in the array for the next occurrence after the current position.

Total worst-case complexity to do a subsequence check would be O(m log n).

Matt Timmermans
  • 53,709
  • 3
  • 46
  • 87
  • Thanks, your idea was a big help and pointed me in the right direction. I found an improved method that maps to Y-fast-tries instead of sorted arrays. This gives a time complexity of **O(m log log n)**. – DannyDannyDanny Jul 26 '18 at 08:53
  • @pkpnd not so - you don't need to use an array for the character -> positions map, and you don't need to store positions for characters that don't appear. – Matt Timmermans Jul 27 '18 at 12:42