Duplicate Key Filtering

Question

I am looking for a distributed solution to screen/filter a large volume of keys in real-time. My application generates over 100 billion records per day, and I need a way to filter duplicates out of the stream. I am looking for a system to store a rolling 10 days’ worth of keys, at approximately 100 bytes per key. I was wondering how this type of large scale problem has been solved before using Hadoop. Would HBase be the correct solution to use? Has anyone ever tried a partially in-memory solution like Zookeeper?

Donald Miner · Accepted Answer · 2013-11-23T21:05:11.340

I can see a number of solutions to your problem, but the real-time requirement really narrows it down. By real-time do you mean you want to see if a key is a duplicate as its being created?

Let's talk about queries per second. You say 100B/day (that's a lot, congratulations!). That's 1.15 Million queries per second (100,000,000,000 / 24 / 60 / 60). I'm not sure if HBase can handle that. You may want to think about something like Redis (sharded perhaps) or Membase/memcached or something of that sort.

If you were to do it in HBase, I'd simply push the upwards of a trillion keys (10 days x 100B keys) as the keys in the table, and put some value in there to store it (because you have to). Then, you can just do a get to figure out if the key is in there. This is kind of hokey and doesn't fully utilize hbase as it is only fully utilizing the keyspace. So, effectively HBase is a b-tree service in this case. I don't think this is a good idea.

If you relax the restraint to not have to do real-time, you could use MapReduce in batch to dedup. That's pretty easy: it's just Word Count without the counting. You group by the key you have and then you'll see the dups in the reducer if multiple values come back. With enough nodes an enough latency, you can solve this problem efficiently. Here is some example code for this from the MapReduce Design Patterns book: https://github.com/adamjshook/mapreducepatterns/blob/master/MRDP/src/main/java/mrdp/ch3/DistinctUserDriver.java

ZooKeeper is for distributed process communication and synchronization. You don't want to be storing trillions of records in zookeeper.

So, in my opinion, you're better served by a in-memory key/value store such as redis, but you'll be hard pressed to store that much data in memory.

I think U have not considered the well. A trillion 100 bytes rows in HBase??? Do U think it is possible??? If we assume yes, can U really get 1.1M queries per second form HBase?? Do you know an available hardware that can store 1 trillion items in memory? — Saeed Shahrivari, Nov 21 '13 at 22:42
No, I don't think HBase is possible. What I was suggesting was the only way I would imagine it to be possible, which is a stretch. I guess I was illustrating how you would do it, to show that it is not reasonable. You should be able to get 1.1M queries per second in HBase in a really large cluster. — Donald Miner, Nov 21 '13 at 23:15
Also, I'm just answering the question. You can definitely store 1trillion items sharded across enough redis instances, even if it is 4 racks of it. — Donald Miner, Nov 21 '13 at 23:18
Dear Donald, don't misunderstand me!!! I don't mean that U R wrong. I just told that it is not practical with commodity hardware and software like HBase and Redis. However, I agree that if U want to solve this U should go with in-memory stores. — Saeed Shahrivari, Nov 22 '13 at 14:34
Thank you all for your comments. I realize this is a challenging problem to solve, without spending millions on hardware/software. Sounds like an in-memory solution is best, but I think it needs to be some kind of hybrid memory/disk solution because I think it would be too costly to store all 10 days’ worth of keys in RAM. I will look into Redis to see if it meets our needs. — scottw, Nov 22 '13 at 16:06

Saeed Shahrivari · Answer 2 · 2013-11-22T14:39:07.113

I am afraid that it is impossible with traditional systems :|

Here is what U have mentioned:

100 billion per days means approximation 1 million per second!!!!
size of the key is 100 bytes.
U want to check for duplicates in a 10 day working set means 1 trillion items.

These assumptions results in look up in a set of 1 trillion objects that totally size in 90 TERABYTES!!!!! Any solution to this real-time problem shall provide a system that can look up 1 million items per second in this volume of data. I have some experience with HBase, Cassandra, Redis, and Memcached. I am sure that U cannot achieve this performance on any disk-based storage like HBase, Cassandra, or HyperTable (and add any RDBMSs like MySQL, PostgreSQl, and... to these). The best performance of redis and memcached that I have heard practically is around 100k operations per second on a single machine. This means that U must have 90 machines each having 1 TERABYTES of RAM!!!!!!!!
Even a batch processing system like Hadoop cannot do this job in less than an hour and I guess it will take hours and days on even a big cluster of 100 machines.

U R talking about very very very big numbers (90 TB, 1M per second). R U sure about this???

https://www.google.com/#q=1+trillion+x+100+bytes is only 90TB, not 100PB. Still a lot but possible with around 10 million dollars! — Donald Miner, Nov 21 '13 at 23:18
Dear Donald, U R right. I made a mistake in my calculation and I edited my answer. But still I think that it is not possible with commodity hardware. As U mentioned, it needs million dollars. — Saeed Shahrivari, Nov 22 '13 at 14:37

Duplicate Key Filtering

2 Answers2