Fastest way to perform subset test operation on a large collection of sets with same domain

Question

Assume we have trillions of sets stored somewhere. The domain for each of these sets is the same. It is also finite and discrete. So each set may be stored as a bit field (eg: 0000100111...) of a relatively short length (eg: 1024). That is, bit X in the bitfield indicates whether item X (of 1024 possible items) is included in the given set or not.

Now, I want to devise a storage structure and an algorithm to efficiently answer the query: what sets in the data store have set Y as a subset. Set Y itself is not present in the data store and is specified at run time.

Now the simplest way to solve this would be to AND the bitfield for set Y with bit fields of every set in the data store one by one, picking the ones whose AND result matches Y's bitfield.

How can I speed this up? Is there a tree structure (index) or some smart algorithm that would allow me to perform this query without having to AND every stored set's bitfield?

Are there databases that already support such operations on large collections of sets?

What type of database are you using? A proprietary format? SQL Server? — Petrus Theron, Dec 28 '10 at 00:57
The choice of DB will depend on whether it efficiently supports the given set operations on humongous sets. None of the SQL DBS will scale to the required size (RDMS DBs would be a poor choice for this problem anyway). So the choice is either a specialized DB or a DB that I will implement myself. — niktech, Dec 28 '10 at 01:55
Have you found some solution? It is strange that there are no well-known databases for this task. — actual, Feb 21 '11 at 05:47

score 4 · Answer 1 · answered Dec 28 '10 at 06:29

4

If you can preprocess the sets, the subset relation is representable as a DAG (because you're describing a poset). If the transitive reduction is computed, then I think you can avoid testing all the sets by just performing a DFS starting from the biggest sets and stopping whenever Y is no longer a subset of the current set being visited.

answered Dec 28 '10 at 06:29

lijie

4,811
22
26

Can you elaborate? Are you basically talking about building a DAG like the following http://en.wikipedia.org/wiki/File:Hypercubeorder_binary.svg but only with nodes from the collection of existing sets? How would I pick the starting node when I do the DFS? – niktech Dec 28 '10 at 09:04
2

yes, essentially. there is an edge from a set A to a set B if A is a superset of B. Using the transitive reduction is better because the number of edges decreases (so the branching factor should also decrease so less useless nodes are examined). Since the graph is acyclic, there is going to be a set of nodes which have no edges that enter them, and you can start from there (these represent the sets which have no supersets in your collection). You'd have to start DFS on all of these (or just start from a virtual node connected to all these sets-with-no-supersets). – lijie Dec 28 '10 at 10:03
Interesting. I'll keep this algorithm in mind, although the collection of sets in the data store is unlikely to have many subset/superset relationships in it, so I would end up doing DFS on a lot of starting nodes. – niktech Dec 28 '10 at 10:37
1

if there are too many heads, then you might want to construct some virtual nodes (that are union of pairs of heads that are similar, say). this might help to discard parts of the collection faster, although it really depends on whether Y is expected to match lots of the sets in the collection -- just a random thought. – lijie Dec 28 '10 at 15:47

score 1 · Answer 2 · answered Dec 28 '10 at 16:39

1

Depending on the cardinality of the set from which all the sets are drawn, one option might be to build an inverted index mapping from elements to the sets that contain them. Given a set Y, you could then find all sets that have Y as a subset by finding all of the sets that contain each element individually and computing their intersection. If you store the lists in sorted order (for example, by numbering all the sets in your database with values 0, 1, etc.) then you should be able to compute this intersection fairly efficiently, assuming that no one element is contained in too many sets.

answered Dec 28 '10 at 16:39

templatetypedef

362,284
104
897
1,065

Good point. The cardinality of sets in the datastore is ~ <= 1024. Now the tricky part will be doing the intersection efficiently. The result of the intersection can be as large as the whole collection of sets or as small as a couple dozen sets. What intersection algorithms would you recommend? – niktech Dec 28 '10 at 19:26
I know that in the case where you have two sorted sequences and want to compute the intersection, you can do so by repeating the following: while the two lists aren't empty, look at the first value of each sequence. If they aren't the same, remove the smaller of the two. If they are the same, then you have detected a value in the intersection. This runs in time O(n + m), where n and m are the lengths of the two sequences. If you run this procedure on pairs of sequences, then on the results, etc. this runs in O(n lg k), where k is the # of sequences and n the max length of a sequence. – templatetypedef Dec 29 '10 at 06:44

score 0 · Answer 3 · answered Dec 28 '10 at 00:57

0

I tend to say that the answer is no, because of the bit field very low cardinality.

answered Dec 28 '10 at 00:57

Paulo Scardine

73,447
11
124
153

score 0 · Answer 4 · answered Dec 28 '10 at 01:03

0

This would be a stretch on a conventional RDBMS based on your volume, have you looked at Neo4j which is based on a graph storage model?

answered Dec 28 '10 at 01:03

Ta01

31,040
13
70
99

1

Does it efficiently support working with large sets? From my understanding it's more useful for storing graphs, not sets. – niktech Dec 28 '10 at 02:02

score 0 · Answer 5 · answered Dec 28 '10 at 16:43

0

A quick glance make me think of BDDs - which is somewhat along the idea of the DAG solution. Alternatively a ZDD.

answered Dec 28 '10 at 16:43

I GIVE CRAP ANSWERS

18,739
3
42
47

score 0 · Answer 6 · answered Feb 09 '11 at 21:40

If an RDBMS was your only option, I would recommend looking at this interesting article on modelling a DAG in SQL:

http://www.codeproject.com/KB/database/Modeling_DAGs_on_SQL_DBs.aspx?msg=3051183

If you can't afford Oracle or MSSQL, have a look at PostgresQL 9, which supports recursive queries. It's also supported cross joins for quite some time.

Fastest way to perform subset test operation on a large collection of sets with same domain

6 Answers6

Linked