Data structure for querying whether a given subset exists in a collection of sets

Question

I'm trying to build a data structure for a word game solver.

I need to store about 150,000 sets of the form {A, A, D, E, I, L, P, T, V, Y}. (They are normalized English words, i.e. characters sorted. Note this is a multiset which can contain the same letter twice.)

Need to efficiently get a yes/no answer to the following kind of query: are there any sets with the given subset? For example, do any of the known words contain the set {D, E, I, L, L, P}?

Requirements:

Queries must be fast
The data structure should fit in a reasonable amount of space (e.g. <50 MB)
The data structure need not be built in real time; it's pre-computed.

Is there any data structure out there that would suit this need well? This is a little different from other set matching questions on StackOverflow in that the target sets are actually multisets.

Sounds like you need to be looking up anagram software for examples. — Orbling, Mar 05 '11 at 01:27
Funny you should mention that; this is for a kind of anagrams; however, I need to find "near-anagrams" or partial anagrams. i.e. I need to find anagrams by rearranging and adding letters from a given pool. — PBJ, Mar 05 '11 at 11:23

flight · Accepted Answer · 2011-03-05T02:15:00.643

This reminds me of a mutated prefix tree/trie that I made once. Slightly different but it might work. It may not work if you have large/no bounds or if you can't convert it to your language (I code in c++).

So basically, in a trie you usually store children corresponding to the next letter but what I did was I stored children corresponding to the frequency of each letter.

What the question basically is (from my point of view) is, "Are there any sets that have the same or more of a letter than in the subset?" For example, if the subset is { A,D,E,E } then you need to find if there is a set with at least one A, one D and two E's.

So, for the trie you have something like this

            Root
           / | \
          / /|\ \
         / / | \ \
        1 2  ... MAX <-- This represents the frequency of "A"
       /|\ ..... /|\
      1..MAX    1..MAX <-- Frequency of "B"
      ...............
      ...............
      ...............
     1 ... ... ... MAX <-- Frequency of "Y"
    /|\ .... .... / | \
   1..MAX ...... 1 .. MAX <-- Frequency of "Z"

Basically all of the ...'s represent lots of stuff that would take too long to show. The /,| and \ represent parent child relationship and MAX represent the maximum frequency of a letter

So what you do, is you have a struct (I code in c++) of some sort like:

struct NODE {
    NODE *child[MAX + 1]; // Pointers to other NODE's that represents
                          // the frequency of the next letter
};

When you create a node you need to initialize all its children to NULL. You can either do this through a constructor (in c++) or a makeNode() function like

NODE* makeNode() {
    NODE* n = new NODE;         // Create a NODE
    for(int i = 0;i <= MAX;i++) // For each child
        n->child[i] = NULL;     // Initialize to NULL
};

At the start, the trie is just a root

NODE* root = new NODE;

When you add a set to the trie, you get the frequency of each letter and go through the trie. If at a particular node, the child corresponding to the next letter is NULL, you just create a new NODE.

When you search the trie, you search all of the children of each node that corresponds to the frequency of the letter in the subset or larger. For example if the subset has 3 A's you search all of the root->child[3] then root->child[4] then ... then root->child[MAX].

It's probably overly complicated and confusing so 1) If you think I'm not mad then please comment on what's confusing and 2) You may/probably want to just find a simpler method

I just implemented this and it's very fast to build, and decently space efficient (~6MB for 180k words). It also works well for many queries. However, unfortunately there are degenerate queries that just have to traverse many, many branches. Perhaps an optimization would be to re-order the levels of the tree in order of their max count, minimizing the amount of backtracking needed. — PBJ, Mar 05 '11 at 11:21
Very interesting! I wonder how this would work when you search for something like "give me all supersets of `[ A,Y]` " ? — CodeNoob, Dec 11 '20 at 20:13

score 2 · Answer 2 · 2011-03-05T23:39:06.420

2

Looks like you could try using KD-Trees or a variant.

A related topic to explore would be multi-dimensional range searching/querying.

Caveat: I haven't used these myself, but I hope you might be able to find something useful by reading some literature on the topic above.

Hope that helps.

edited Mar 05 '11 at 23:39

answered Mar 05 '11 at 01:42

2

If I understand the suggestion correctly, the idea is to treat each multiset of letters can be as a 26-element vector. The subset queries correspond to orthogonal range queries. – mhum Mar 05 '11 at 02:11
Did some searching and it sounds like a 26-D range tree is exactly what I need, but it's so complex to implement! – PBJ Mar 05 '11 at 11:19
@David: I am guessing there must be off the shelf solutions out there. Of course, I haven't tried looking for them myself. – Mar 05 '11 at 14:19
2

Don't bother. The worst-case query time for a 26-dimensional kd-tree is O(n^(1-1/26)), which is basically linear. The Wikipedia article suggests that in practice, N (150,000) should be much larger than 2^k (2^26 ≈ 64,000,000). – user635541 Mar 05 '11 at 22:43
@user: _or a variant_... The point of this answer was to point to the concept of multidimensional range query/search which has a vast literature. Anyway... – Mar 05 '11 at 23:02
Ok, but approximately 0 proposed algorithms in that literature deal well with high-dimensional data like this. – user635541 Mar 05 '11 at 23:23
@user: You don't _have_ to deal with 26 dimensions... And please don't tell me know all there is know about existing literature on multi-dimensional range search/query. If you do, I suggest you add an answer! – Mar 05 '11 at 23:26
How could this problem be reduced to fewer than 26 dimensions? I was thinking of range trees, but it actually looks like this may not be feasible. According to http://en.wikipedia.org/wiki/Range_tree it seems a range tree would take O(log(n) ** d) time, and O(n * log(n)**(d-1)) space. With n=180,000 and d=26 that is actually going to be a big problem --- so unfortunately user635541 may be right, at least for this particular algorithm. – PBJ Mar 06 '11 at 03:13
2

@David: You could combine dimensions. Say merge A and B. C and D etc. Once you get to the reduced list of candidates, you do a linear search/look up a different multidimensional structure etc. Basically combine this approach with different approaches in order to do a tradeoff between space and time. Without your actual data and access patterns, the best one can do is suggest _general_ structures which might be useful. What you ask seems to be quite closely connected to the range queries, I would actually be interested to see a more efficient solution! – Mar 06 '11 at 04:38

atx · Answer 3 · 2011-03-05T01:41:54.777

0

You could probably use a trie and insert each set into the trie, traversing the trie iteratively using your target subset to find out if you have a matching subset. At least that's how I think I would do it.

The 'trie' was actually conceived for a reTRIEvable data structure, and is pretty much like a normal tree, but has nodes with different permutations, for example:

     A
    / \
   AT AN
     / | \
    |  |  AND
   ANN ANY
    |
  ANNA

In the above example, you can see that this is probably useful for your case, as ANN and ANNA can be retrieved like a set. You might want to use some permutation code, along with this type of ADT (Abstract Data Type).

Find more here

edited Mar 05 '11 at 01:41

answered Mar 05 '11 at 01:32

atx

4,831
3
26
40

I considered a trie, but this direct approach doesn't really work. Consider the trie with just one "word" in it, "AANN". Then, we lookup "ANN" to see if it's in the trie, and it wouldn't be. I did try this technique earlier using something like a DAWG (directed acyclic word graph), adding multiple routes to each valid set, but the size was enormous. The main difficulty is that for a subset of length m, you have O(m!) ways to get there--adding characters in different orders. – PBJ Mar 05 '11 at 04:17

Data structure for querying whether a given subset exists in a collection of sets

3 Answers3