27

Hi all and thanks in advance. I am new to the NoSQL game but my current place of employment has tasked me with set comparisons of some big data.

Our system has customer tag set and targeted tag sets. A tag is an 8 digit number.
A customer tag set may have up to 300 tags but averages 100 tags
A targeted tag set may have up to 300 tags but averages 40 tags.

Pre calculating is not an option as we are shooting for a potential customer base of a billion users.

(These tags are hierarchical so having one tag implies that you also have its parent and ancestor tags. Put that info aside for the moment.)

When a customer hits our site, we need to intersect their tag set against one million targeted tag sets as fast as possible. The customer set must contain all elements of the targeted set to match.

I have been exploring my options and the set intersection in Redis seems like it would be ideal. However, my trolling through the internet has not revealed how much ram would be required to hold one million tag sets. I realize the intersection would be lightning fast, but is this a feasable solution with Redis.

I realize this is brute force and inefficient. I also wanted to use this question as means to get suggestions for ways this type of problem has been handled in the past. As stated before, the tags are stored in a tree. I have begun looking at Mongodb as a possible solution as well.

Thanks again

Community
  • 1
  • 1
MFD3000
  • 854
  • 1
  • 11
  • 26
  • This is a typical storage/memory usage vs. processing time dilemma, isn't it? You can calculate the resulting tag-set on tag updates, store it, and serve it faster or make a dynamic calculation when the data is really needed. You may consider choosing the first option if tag updates are not that common or think about a clustered database option (Clustrix, for an example) – Ege Özcan Jun 19 '12 at 06:57
  • Thank you. I should have specified. We currently precalculate, but if we succeed as a company, we could be looking at a billion potential customers. I will review Clusterix – MFD3000 Jun 19 '12 at 07:04
  • Mongodb offers nothing for set intersection. And if you get some RAM (like 100+ GB), you can store quite a number of keys in redis :) – Sergio Tulentsev Jun 19 '12 at 10:28
  • as others have mentioned, MongoDB hasn't got anything special going for it for fast intersection. Redis has good set support, but afaik nothing special for fast intersections, such as bitset intersection, etc. Have a look at Lucene/Solr for fast implementations for example (which you may be able to use as a reference). Memory wise: 1 mil tags is 1 mil bITS, + a hashmap containing 1 mil tags once. So that should be doable :). + – Geert-Jan Jun 19 '12 at 15:08
  • Redis has an efficient intset data structure, a smart intersection algorithm for multiple sets, and can manipulate bitsets with the BITOP command if needed (http://redis.io/commands/bitop) – Didier Spezia Jun 19 '12 at 20:12
  • @Didier thanks, learning every day – Geert-Jan Jun 20 '12 at 09:15
  • Thanks guys. This was all really interesting. Gave me some good thoughts on Redis. Ive also come up with some architectual structuring for mongoDB that might get me where I need to. Ill post that stack overflow discussion when I get there – MFD3000 Jun 21 '12 at 15:56

3 Answers3

30

This is an interesting problem, and I think Redis can help here.

Redis can store sets of integers using an optimized "intset" format. See http://redis.io/topics/memory-optimization for more information.

I believe the correct data structure here is a collection of targeted tag sets, plus a reverse index to map tags to their targeted tag sets.

To store two targeted tag sets:

 0 -> [ 1 2 3 4 5 6 7 8 ]
 1 -> [ 6 7 8 9 10 ]

I would use:

 # Targeted tag sets
 sadd tgt:0 1 2 3 4 5 6 7 8
 sadd tgt:1 2 6 7 8 9 10
 # Reverse index
 sadd tag:0 0
 sadd tag:1 0
 sadd tag:2 0 1
 sadd tag:3 0
 sadd tag:4 0
 sadd tag:5 0
 sadd tag:6 0 1
 sadd tag:7 0 1
 sadd tag:8 0 1
 sadd tag:9 1
 sadd tag:10 1

This reverse index is quite easy to maintain when targeted tag sets are added/removed from the system.

The global memory consumption depends on the number of tags which are common to multiple targeted tag sets. It is quite easy to store pseudo-data in Redis and simulate the memory consumption. I have done it using a simple node.js script.

For 1 million targeted tag sets (tags being 8 digits numbers, 40 tags per set), the memory consumption is close to 4 GB when there are very few tags shared by the targeted tag sets (more than 32M entries in the reverse index), and about 500 MB when the tags are shared a lot (only 100K entries in the reverse index).

With this data structure, finding the targeted tag sets containing all the tags of a given customer is extremely efficient.

1- Get customer tag set (suppose it is 1 2 3 4)
2- SINTER tag:1 tag:2 tag:3 tag:4
   => result is a list of targeted tag sets having all the tags of the customer

The intersection operation is efficient because Redis is smart enough to order the sets per cardinality and starts with the set having the lowest cardinality.

Now I understand you need to implement the converse operation (i.e. finding the targeted tag sets having all their tags in the customer tag set). The reverse index can still help.

Here in an example in ugly pseudo-code:

1- Get customer tag set (suppose it is 1 2 3 4)
2- SUNIONSTORE tmp tag:1 tag:2 tag:3 tag:4
   => result is a list of targeted tag sets having at least one tag in common with the customer
3- For t in tmp (iterating on the selected targeted tag sets)
      n = SCARD tgt:t (cardinality of the targeted tag sets)
      intersect = SINTER customer tgt:t
      if n == len(intersect), this targeted tag set matches

So you never have to test the customer tag set against 1M targeted tag sets. You can rely on the reverse index to restrict the scope of the search to an acceptable level.

Didier Spezia
  • 70,911
  • 12
  • 189
  • 154
7

this might be helpful:

Case Study: Using Redis intersect on very large sets (120M+ with 120M+)

http://redis4you.com/articles.php?id=016&name=Case+Study%3A+Using+Redis+intersect+on+very+large+sets

Westy92
  • 19,087
  • 4
  • 72
  • 54
Nick
  • 9,962
  • 4
  • 42
  • 80
  • 1
    the link is down. here is the archived version of this article: https://web.archive.org/web/20170226145031/http://redis4you.com/articles.php?id=016&name=Case+Study%3A+Using+Redis+intersect+on+very+large+sets – guilemon Oct 13 '20 at 04:03
5

The answers provided helped me initially. However as our customer base grew, I stumbled across a great technique involving using redis string bits and bit operators to perform analytics on hundreds of millions of users very quickly.

Check this article out. Antirez, creator of redis, also references this a lot.

http://blog.getspool.com/2011/11/29/fast-easy-realtime-metrics-using-redis-bitmaps/

MFD3000
  • 854
  • 1
  • 11
  • 26