4

So this is motivated by a recent question, which asked how to quickly determine if a query word could be permuted to match a particular word in a given dictionary of words. The basic idea for a quick query solution was simple: First, for preprocessing, for each dictionary word, hash the tuple of how many times each letter in the alphabet occurs, and then after preprocessing, for a query word all you have to do is hash the same type of tuple and see whether or not you get a match in your hash table.

So basically, that problem came down to figuring out whether a tuple of non-negative integers (counts of each letter in the alphabet) exactly matched a tuple in the hash table, where the hash table could first be constructed quickly and not take up too much memory, compared to the size of the original dictionary of words.

So now I want to extend the problem, and in terms of strings, the extended problem is whether a query string can be permuted to match a SUB-sequence of one of the dictionary strings (i.e., not necessarily contiguous sub-sequence, although the contiguous case is interesting too). In terms of tuples of character counts, this is equivalent to determining whether there is a tuple in the dictionary which dominates the query tuple, i.e. every count in the dictionary word's tuple is greater than or equal to the corresponding count in the query word's tuple.

In the hopes of getting a fast solution, let's say the problem is simply answering yes/no for the answer to the query, and if yes, returning just one possible dictionary word (count tuple) that satisfies the query.

Is there any kind of preprocessing that would take a reasonable amount of time/memory in terms of the dictionary size, such that these subsequence permutation questions could be answered more quickly than say, just by sorting multiple copies of the dictionary word data set, where each copy is sorted by occurrences of a particular character, and then the sorted list that gives the lowest number of members satisfying the query string for that character are linearly searched for a match?

I have a bad feeling that what I might be wanting is a potentially very high dimensional range tree (dimension is number of characters in the alphabet), so that range queries can be performed. However the range queries for this problem have a very special form so I'm hoping for something better, especially since for an alphabet of size d and dictionary of n words, the range tree approach would require O(n (log n)^(d-1)) preprocessing time and storage, and queries would take O((log n)^(d-1)) time. Depending on d, the range tree could easily have empirical query time exceeding brute force O(nw) query time for a dictionary of n words of length no more than w, and that brute force approach wouldn't even require any preprocessing.

user2566092
  • 4,631
  • 15
  • 20
  • 1
    Do you want to know *all matching tuples*, or just the *yes* (the match exists) or *no* (no such match available) answer? – dlask Jul 17 '15 at 20:32
  • 1
    If the answer to @dlask's question is that you just need a yes or no answer, then you can immediately discard any dictionary word that is dominated by another dictionary word (e.g. you can discard `cat` if `scratch` is in the dictionary, since any query word dominated by `cat` is necessarily also dominated by `scratch`). – j_random_hacker Jul 17 '15 at 22:52
  • @dlask I should have specified, sorry. For the sake of potentially having a possible fast solution, let's say that all we need is yes/no and if yes, then just one word/tuple satisfying the query. I'll update the question to make that clear. – user2566092 Jul 18 '15 at 19:12

1 Answers1

0

If we want to shorten the search phase, our preprocessing phase can generate (and hash) all allowed sub-sequences for all input sequences. Then the search phase is reduced to the check of presence/absence of a single specific hash obtained from the sequence to be found.

The idea is demonstrated by the following Python code.

The itertools Python library helps us to find combinations.

import itertools

Considering we have the input data available as ordered sequences:

# Keep the values ordered
DATA = (
    (1, 2, 2, 4),
    (1, 3, 5, 5),
    (2, 4),
    (2, 4, 6, 6, 6, 7, 8, 9, 10, 11, 12, 13, 15),
)

We can pre-calculate the set of all allowed sub-sequences:

ALLOWED = {c for d in DATA for l in range(len(d)) for c in itertools.combinations(d, l+1)}

And we can test it:

assert (1,) in ALLOWED
assert (1, 2) in ALLOWED
assert (1, 3) in ALLOWED
assert (1, 4) in ALLOWED
assert (1, 5) in ALLOWED
assert (1, 6) not in ALLOWED
assert (1, 2, 2, 4) in ALLOWED
assert (1, 2, 2, 5) not in ALLOWED
dlask
  • 8,776
  • 1
  • 26
  • 30
  • This would probably work well with something like a real world character-based language like English, since word length is pretty low. However in the general case, this approach could blow up memory usage and preprocessing time exponentially. – user2566092 Jul 18 '15 at 20:08