Need algorithm for fast storage and retrieval (search) of sets and subsets

Question

I need a way of storing sets of arbitrary size for fast query later on. I'll be needing to query the resulting data structure for subsets or sets that are already stored.

=== Later edit: To clarify, an accepted answer to this question would be a link to a study that proposes a solution to this problem. I'm not expecting for people to develop the algorithm themselves. I've been looking over the tuple clustering algorithm found here, but it's not exactly what I want since from what I understand it 'clusters' the tuples into more simple, discrete/aproximate forms and loses the original tuples.

Now, an even simpler example:

[alpha, beta, gamma, delta] [alpha, epsilon, delta] [gamma, niu, omega] [omega, beta]

Query:

[alpha, delta]

Result:

[alpha, beta, gama, delta] [alpha, epsilon, delta]

So the set elements are just that, unique, unrelated elements. Forget about types and values. The elements can be tested among them for equality and that's it. I'm looking for an established algorithm (which probably has a name and a scientific paper on it) more than just creating one now, on the spot.

== Original examples:

For example, say the database contains these sets

[A1, B1, C1, D1], [A2, B2, C1], [A3, D3], [A1, D3, C1]

If I use [A1, C1] as a query, these two sets should be returned as a result:

[A1, B1, C1, D1], [A1, D3, C1]

Example 2:

Database:

[Gasoline amount: 5L, Distance to Berlin: 240km, car paint: red]
[Distance to Berlin: 240km, car paint: blue, number of car seats: 2]
[number of car seats: 2, Gasoline amount: 2L]

Query:

[Distance to berlin: 240km]

Result

[Gasoline amount: 5L, Distance to Berlin: 240km, car paint: red]
[Distance to Berlin: 240km, car paint: blue, number of car seats: 2]

There can be an unlimited number of 'fields' such as Gasoline amount. A solution would probably involve the database grouping and linking sets having common states (such as Gasoline amount: 240) in such a way that the query is as efficient as possible.

What algorithms are there for such needs?

I am hoping there is already an established solution to this problem instead of just trying to find my own on the spot, which might not be as efficient as one tested and improved upon by other people over time.

Clarifications:

If it helps answer the question, I'm intending on using them for storing states: Simple example: [Has milk, Doesn't have eggs, Has Sugar]
I'm thinking such a requirement might require graphs or multidimensional arrays, but I'm not sure

Conclusion I've implemented the two algorithms proposed in the answers, that is Set-Trie and Inverted Index and did some rudimentary profiling on them. Illustrated below is the duration of a query for a given set for each algorithm. Both algorithms worked on the same randomly generated data set consisting of sets of integers. The algorithms seem equivalent (or almost) performance wise:

enter image description here

Not sure why the question got downvoted...would be helpful if the person provided a reason — Ed Rowlett-Barbu, Jun 04 '14 at 10:38
because if they have the same types, you don’t have tuples, you have vectors. Huge difference — bolov, Jun 04 '14 at 10:40
what is the type and range of the values? how large are the vectors? — Pavel, Jun 04 '14 at 10:40
Yes, they are different types. You can't match a `C?` with an `A?` — Ed Rowlett-Barbu, Jun 04 '14 at 10:40
sorry, the clarification is still somewhat unclear. do you want to store strings like "has milk" etc, or are they going to be int with some bitwise encoding scheme? how many different states are there? — Pavel, Jun 04 '14 at 10:44
added one more bullet clarifying the need. The tuple elements can be of any type and value, and there is no known number of them beforehand — Ed Rowlett-Barbu, Jun 04 '14 at 10:50
Do you see any problems with representing each tuple as a `std::hashmap`, or something like that? That would make this problem very easy. — QuestionC, Jun 04 '14 at 11:14
How large are the tuples? If they are small enough, you can create an index of all the subsets — Niklas B., Jun 04 '14 at 11:31
They're pretty large. They're used to store states of an artificial intelligence. Initially, the tuple can be quite large until the AI figures out certain states in the tuple are irrelevant to the decision making and thus removes them from the tuple. Could reach maybe 50 elements in more advanced cases, maybe — Ed Rowlett-Barbu, Jun 04 '14 at 11:33
`"Not sure why the question got downvoted"` - possibly because you're just listing your requirements and asking for an algorithm, but showing no attempt at writing / finding one yourself. And we're a programming site more than a research community - expect a self-written algorithm, and (as a personal guideline) you should do (and show) more research than the amount of work you'd expect an answerer to do (thinking through, writing and analysing a concrete algorithm is quite a bit of work - you should show a similar amount of research effort in your question). — Bernhard Barker, Jun 04 '14 at 15:37
@Dukeling I'm very much aware of the scope of this website, I've been active on it and contributing to it for quite a while now. The only work an answerer has to do is reply if he's aware of any formally known algorithm for my problem, I'm not asking anyone for devising one on the spot, I'm very much capable of that myself :) — Ed Rowlett-Barbu, Jun 04 '14 at 16:06
can you tell a limit on total tuples and you would also give some probablity of tupples being together so that we can test some algorithms. — Vikram Bhat, Jun 17 '14 at 12:59
There is no limit on tuple size or count. They can get arbitrarily large since they describe a collection of states, and in an environment there can be any number of states. — Ed Rowlett-Barbu, Jun 17 '14 at 14:15
I'm curious at all of the comments and answers that seem to be thinking of C++ tuples rather than "this is a database". Why is nobody mentioning SQL? — Mooing Duck, Jun 17 '14 at 20:47
Are these states and queries ordered? If yes, then they are sets, not tuples. The entire [relational model](https://en.wikipedia.org/wiki/Relational_model) theory, which is what most RDBMS/SQL databases are based on, are dedicated to working on sets of data and how to query them on various criterias, how to optimize the queries, and how they can be stored and implemented efficiently. Perhaps you need an in-memory sqlite database? — Lie Ryan, Jun 22 '14 at 13:14
Indeed maybe a database is what I need. But now I've sort of already implemented two of the suggestions in the answers and I feel like a database wouldn't do any better performance wise except be a bit too much for my needs. — Ed Rowlett-Barbu, Jun 22 '14 at 17:06
Wow, this is really cool that you took the time to run the actual benchmarks! Too bad I can only upvote your question once. I can't really tell how significant the differences are, especially since I don't know how much the random generated data differs from your actual data. — Pavel, Jun 22 '14 at 21:28
@Pavel well, sad thing is I was expecting noticeable differences between the two, at least in some of the tests. I can't believe how similar they are... — Ed Rowlett-Barbu, Jun 23 '14 at 06:52
@Pavel and, well, it doesn't really differ at all. Just imagine any state can be assigned a unique integer id, and use that to identify it in the database. Thus, it would be equivalent to my benchmarks. I think it's pretty accurate. Not identical, but close. But the benchmarks are too similar. I'll have to double check the implementation. To be honest, I don't even know which one to use, and whose answer to award the bounty to... — Ed Rowlett-Barbu, Jun 23 '14 at 07:05
If the implementation is correct and the benchmarks prove accurate, I think both answers deserve the bounty — Ed Rowlett-Barbu, Jun 23 '14 at 07:19
Thanks for awarding the bounty. I must admit that the answers of Pavel and of aa333 also look very good and together with my answer give a range of possible implementations that surprisingly seem to deliver almost equal performance. I wish I could upvote the question more often because I feel this is a more tough question than the average question on SO but also a quite interesting question. Judging by the number and quality of the answers the bounty was well worth it. — NoDataDumpNoContribution, Jun 24 '14 at 07:41
@Trilarion I agree. Awarding the bounty was a tough choice. The reason I awarded it to Set-Tree is that arranging the data this way more accurately reflects the relations between the sets and I foresee it's easier to add new operations on this data structure than the others. The downside of this is it probably can't be optimized much more, at least not as much as the hash and the bitmap solutions can. Maybe I'll switch to one of the other solutions later on, who knows? They're all great. — Ed Rowlett-Barbu, Jun 24 '14 at 07:57

score 4 · Answer 1 · answered Jun 17 '14 at 20:39

I'm confident that I can now contribute to the solution. One possible quite efficient way is a:

Trie invented by Frankling Mark Liang

Such a special tree is used for example in spell checking or autocompletion and that actually comes close to your desired behavior, especially allowing to search for subsets quite conveniently.

The difference in your case is that you're not interested in the order of your attributes/features. For your case a Set-Trie was invented by Iztok Savnik.

What is a Set-Tree? A tree where each node except the root contains a single attribute value (number) and a marker (bool) if at this node there is a data entry. Each subtree contains only attributes whose values are larger than the attribute value of the parent node. The root of the Set-Tree is empty. The search key is the path from the root to a certain node of the tree. The search result is the set of paths from the root to all nodes containing a marker that you reach when you go down the tree and up the search key simultaneously (see below).

But first a drawing by me:

Simple Set-Trie drawing

The attributes are {1,2,3,4,5} which can be anything really but we just enumerate them and therefore naturally obtain an order. The data is {{1,2,4}, {1,3}, {1,4}, {2,3,5}, {2,4}} which in the picture is the set of paths from the root to any circle. The circles are the markers for the data in the picture.

Please note that the right subtree from root does not contain attribute 1 at all. That's the clue.

Searching including subsets Say you want to search for attributes 4 and 1. First you order them, the search key is {1,4}. Now startin from root you go simultaneously up the search key and down the tree. This means you take the first attribute in the key (1) and go through all child nodes whose attribute is smaller or equal to 1. There is only one, namely 1. Inside you take the next attribute in the key (4) and visit all child nodes whose attribute value is smaller than 4, that are all. You continue until there is nothing left to do and collect all circles (data entries) that have the attribute value exactly 4 (or the last attribute in the key). These are {1,2,4} and {1,4} but not {1,3} (no 4) or {2,4} (no 1).

Insertion Is very easy. Go down the tree and store a data entry at the appropriate position. For example data entry {2.5} would be stored as child of {2}.

Add attributes dynamically Is naturally ready, you could immediately insert {1,4,6}. It would come below {1,4} of course.

I hope you understand what I want to say about Set-Tries. In the paper by Iztok Savnik it's explained in much more detail. They probably are very efficient.

I don't know if you still want to store the data in a database. I think this would complicate things further and I don't know what is the best to do then.

Btw. it seems that a Trie is a special variant of inverted index. I found a Trie implementation at [Rosseta Code](http://rosettacode.org/wiki/Inverted_index#C). — NoDataDumpNoContribution, Jun 20 '14 at 09:49
Thanks, Trilarion. I implemented the Set-Trie and did some rudimentary profiling for it, it's in my question post. — Ed Rowlett-Barbu, Jun 22 '14 at 17:45

score 2 · Answer 2 · answered Jun 04 '14 at 11:03

2

How about having an inverse index built of hashes?

Suppose you have your values int A, char B, bool C of different types. With std::hash (or any other hash function) you can create numeric hash values size_t Ah, Bh, Ch.

Then you define a map that maps an index to a vector of pointers to the tuples

std::map<size_t,std::vector<TupleStruct*> > mymap;

or, if you can use global indices, just

std::map<size_t,std::vector<size_t> > mymap;

For retrieval by queries X and Y, you need to

get hash value of the queries Xh and Yh
get the corresponding "sets" out of mymap
intersect the sets mymap[Xh] and mymap[Yh]

answered Jun 04 '14 at 11:03

Pavel

7,436
2
29
42

I was thinking of something similar, except with graphs. Meaning, state [X, Y] would be an immediate neighbour and directly linked with states [X] and [Y]. Looking for a common neighbour for [X] and [Y] would find [X, Y]. Similar to your interesction step. I was hoping there is already an established solution to this problem instead of just trying to find my own, which might not be as efficient – Ed Rowlett-Barbu Jun 04 '14 at 11:08
1

in order to choose a data structure and an algorithm, you should find out what are the requirements: how often is the database updated? how large is the DB going to get? how often do users retrieve tuples? how large are the queries? do you need to serialize the data structure? do you need multi-threaded access for r/w? what is going to be the bottleneck? – Pavel Jun 04 '14 at 11:11
I upvoted your answer, but I'm really hoping for a more established solution to this problem. – Ed Rowlett-Barbu Jun 04 '14 at 11:42
@Zadirion: Such a graph would require an _immense_ amount of space. If we assume that each element takes a single byte, storing a single tuple with 50 elements and each of it's neighbors would take 1024TB of space. How big is your harddrive? – Mooing Duck Jun 17 '14 at 20:49
1

@Zadirion: This *is* an established solution to the problem that you've described. It's an [inverted index](http://en.wikipedia.org/wiki/Inverted_index), which is quite well known and the basis of most search engine retrieval algorithms. – Jim Mischel Jun 18 '14 at 12:04
Thanks Pavel for the answer. I did some rudimentary profiling vs Set-Trie, it can be found in my question post. – Ed Rowlett-Barbu Jun 22 '14 at 17:44

aa333 · Answer 3 · 2014-06-23T14:20:33.183

2

If I understand your needs correctly, you need a multi-state storing data structure, with retrievals on combinations of these states.

If the states are binary (as in your examples: Has milk/doesn't have milk, has sugar/doesn't have sugar) or could be converted to binary(by possibly adding more states) then you have a lightning speed algorithm for your purpose: Bitmap Indices

Bitmap indices can do such comparisons in memory and there literally is nothing in comparison on speed with these (ANDing bits is what computers can really do the fastest).

http://en.wikipedia.org/wiki/Bitmap_index

Here's the link to the original work on this simple but amazing data structure: http://www.sciencedirect.com/science/article/pii/0306457385901086

Almost all SQL databases supoort Bitmap Indexing and there are several possible optimizations for it as well(by compression etc.):

MS SQL: http://technet.microsoft.com/en-us/library/bb522541(v=sql.105).aspx

Oracle: http://www.orafaq.com/wiki/Bitmap_index

Edit: Apparently the original research work on bitmap indices is no longer available for free public access.
Links to recent literature on this subject:

edited Jun 23 '14 at 14:20

answered Jun 18 '14 at 23:32

aa333

2,556
16
23

Hmm, all states can be turned into binary form I think. It's certainly worth investigating. – Ed Rowlett-Barbu Jun 22 '14 at 17:43
Turning a state that can take n values to binary will need log(n) binary states and if n is small, say n<20 then you can have a direct lookup table that converts from the log(n) binary state values to an actual n value in O(1) time. – aa333 Jun 22 '14 at 20:28
It's a simple data structure but a problem might be a lot of zeros in the bitmap index when you have many attributes but only search for a few. Otherwise I wouldn't be surprised if the performance is very similar to the other solutions. – NoDataDumpNoContribution Jun 22 '14 at 20:52
@Trilarion I'm not sure why a lot of zeroes is a problem. Could you elaborate? – aa333 Jun 22 '14 at 21:02
If you mean sets with very few elements, then I don't think it's too big a problem, because you'd be using the minimum possible storage anyways, which will still be equal to or smaller on amortization than a different data structure that needs to store other information as well e.g. tree pointers in Tries. – aa333 Jun 22 '14 at 21:13
@aa333 he means there are many states. And a typical "row" in the db will have many zeros because most sets won't have that many states set compared to the number of columns /possible states in the db. I was thinking about this too, it might not be such a good solution for a problem that can have thousands of possible states. – Ed Rowlett-Barbu Jun 22 '14 at 21:22
Yes, I understand that. But 1k possible states mean 1k bits and a retrieval is 1k ANDs in the worst case. Amortize that over different sizes of sets and consider the constant speedup due to inexpensiveness of ANDs, it's still efficent. – aa333 Jun 22 '14 at 21:46
@aa333 Yes I meant that storage of the binary index is number of entries times number of attributes while for the other methods it's number of entries times a constant which I guess would be less in most cases. It doesn't have to be a disadvantage really because as long as the bitmap fits into memory everything is fine. What I meant was specifically the case of a sparse bitmap, e.g. a binary bitmap which has a large fraction of zeros. – NoDataDumpNoContribution Jun 23 '14 at 08:38
@aa333 I'll implement it tonight and see how it fares in the benchmarks. For a max value in a set of 100, the state sets will be encoded in a vector of two 64 bit values, each bit representing one value: 0 will be bit 0, 1 will be the first bit, 2 will be the second bit set, etc. At a first glance it looks like an AND would check 64 states with one instruction which is pretty neat. Downside is a query would have to iterate over the entire database of sets to extract all matches, am I understanding this right? – Ed Rowlett-Barbu Jun 23 '14 at 11:29
@aa333 your link to sciencedirect requires a purchase of the article from what I can see. Am I missing something? – Ed Rowlett-Barbu Jun 23 '14 at 13:26
Didn't get to finish a full benchmark but at least in the "set size" benchmark it fared about the same as the other two solutions. – Ed Rowlett-Barbu Jun 23 '14 at 22:10

score 1 · Answer 4 · answered Jun 19 '14 at 00:49

This problem is known in the literature as subset query. It is equivalent to the "partial match" problem (e.g.: find all words in a dictionary matching A??PL? where ? is a "don't care" character).

One of the earliest results in this area is from this paper by Ron Rivest from 1976¹. This² is a more recent paper from 2002. Hopefully, this will be enough of a starting point to do a more in-depth literature search.

Rivest, Ronald L. "Partial-match retrieval algorithms." SIAM Journal on Computing 5.1 (1976): 19-50.
Charikar, Moses, Piotr Indyk, and Rina Panigrahy. "New algorithms for subset query, partial match, orthogonal range searching, and related problems." Automata, Languages and Programming. Springer Berlin Heidelberg, 2002. 451-462.

I would be very curious to see kind of a small summary of what these algorithms rely on? — NoDataDumpNoContribution, Jun 19 '14 at 20:40

score 1 · Answer 5 · answered Jun 23 '14 at 08:03

This seems like a custom made problem for a graph database. You make a node for each set or subset, and a node for each element of a set, and then you link the nodes with a relationship Contains. E.g.:

enter image description here

Now you put all the elements A,B,C,D,E in an index/hash table, so you can find a node in constant time in the graph. Typical performance for a query [A,B,C] will be the order of the smallest node, multiplied by the size of a typical set. E.g. to find {A,B,C] I find the order of A is one, so I look at all the sets A is in, S1, and then I check that it has all of BC, since the order of S1 is 4, I have to do a total of 4 comparisons.

A prebuilt graph database like Neo4j comes with a query language, and will give good performance. I would imagine, provided that the typical orders of your database is not large, that its performance is far superior to the algorithms based on set representations.

score 0 · Answer 6 · edited May 23 '17 at 11:57

Hashing is usually an efficient technique for storage and retrieval of multidimensional data. Problem is here that the number of attributes is variable and potentially very large, right? I googled it a bit and found Feature Hashing on Wikipedia. The idea is basically the following:

Construct a hash of fixed length from each data entry (aka feature vector)
The length of the hash must be much smaller than the number of available features. The length is important for the performance.

On the wikipedia page there is an implementation in pseudocode (create hash for each feature contained in entry, then increase feature-vector-hash at this index position (modulo length) by one) and links to other implementations.

Also here on SO is a question about feature hashing and amongst others a reference to a scientific paper about Feature Hashing for Large Scale Multitask Learning.

I cannot give a complete solution but you didn't want one. I'm quite convinced this is a good approach. You'll have to play around with the length of the hash as well as with different hashing functions (bloom filter being another keyword) to optimize the speed for your special case. Also there might still be even more efficient approaches if for example retrieval speed is more important than storage (balanced trees maybe?).

Let me know if I understood incorrectly, but storing each feature vector as a hash would prevent finding the vector when searching for a sub-vector of those features (see my question). I don't want just full matches, I also need to find partial matches. — Ed Rowlett-Barbu, Jun 17 '14 at 14:39
Also, there can be any number of features, they might get added at a later time. So, changing the number of possible features effectively changes the feature-vector-hash size, which invalidates the database up to that point. Maybe I misunderstood... — Ed Rowlett-Barbu, Jun 17 '14 at 14:44
Adding a feature later is no problem with the proposed hash since for each feature you add one to a specific number. You can easily undo this or add another one somewhere in case a new feature arrives. Searching for subsets is a bit more difficult though and it might be impossible with this approach. I will think about it. — NoDataDumpNoContribution, Jun 17 '14 at 17:06

Need algorithm for fast storage and retrieval (search) of sets and subsets

6 Answers6

Linked