6

If I have a set of tags (<100), and a set of objects (~25000), where each object has some sub-set of the tags, do you know of an existing data-structure that would allow for fast retrieval of those objects that satisfy some boolean function of the tags?

Addition/deletion of tags and objects need not be particularly fast, but selection of those objects with tags that satisfy the boolean function should be.

Now that I have written my question down, it looks as if I'm describing an in-memory database, but originally I was thinking of some binary tree like structure for the objects where, for each branch, taking the left/right branch would be equivalent to deciding on have/have-not some tag. But that would not allow don't-care tags? i am asking as I wondered if this has been done before and find it hard to google for data structures.

  • Thanks in advance - Paddy.
Paddy3118
  • 4,704
  • 27
  • 38
  • I note that the answer here: http://stackoverflow.com/questions/3538322/many-to-many-data-structure-in-python is to use an in-memory DB. – Paddy3118 Aug 22 '10 at 10:22
  • The boolean function can be different, say, based on user input or is it just one function (or, a known set of functions)? If not, a database looks like the best option and a query language will probably be your best bet. You could otherwise simulate a database and incrementally build a decision tree depending on the inputs and cache this tree (acts as an index). – dirkgently Aug 22 '10 at 10:26
  • Hi dirkgently, The function would be based on user input, and fast enough would be difficult to assess so soon in the project, but because it is early - I would like to explore options. Thanks. – Paddy3118 Aug 22 '10 at 10:37
  • So, you really have two options: 1) Go for an existing DB engine 2) Create a complex B-Tree based structure to remember queries. Also, you may be well served to do optimize the queries before jumping into a search/retrieval (which potentially augments your cache). – dirkgently Aug 22 '10 at 10:43

2 Answers2

6

Here's a suggestion: Use a bit-array for each tag, with as many elements as there are objects; each index of which represents one object. The value at each index is 1 if that object has that tag.

Boolean functions on tags are then fast set operations on this bit-array. And the resulting bit-array gives you the documents that satisfy the criteria.

This not very efficient if the tags or objects keep changing frequently but is perhaps applicable for you.

Miserable Variable
  • 28,432
  • 15
  • 72
  • 133
0

How fast you would need? How complex you boolean function are i.e. how many tags are used in single typical function?

How about using some in memory SQL database? You could then express the boolean function with simple SELECT query.

Juha Syrjälä
  • 33,425
  • 31
  • 131
  • 183