0

I have a collection in MongoDB with a sample doc as follows -

{
    "_id" : ObjectId("58114e5e43d6420b7db4e15c"),
    "browser" : "Chrome",
    "name": "hyades",
    "country" : "in",
    "day" : "16-10-21",
    "ip" : "0.0.0.0",
    "class" : "A123"
}

Problem Statement

I should be able to group on any of the fields while fetching the distinct number of IPs.

The aggregation query -

[
    {$group: {_id: '$class', ip_arr: {$addToSet: '$ip'}}},
    {$project: {class: '$_id.class', ip: {$size: '$ip_arr'}}}
]

gives the desired results, but is slow. Similarly counting ip using another $group is slow. The output is -

[{class: "A123",ip: 42},{class: "B123", ip: 56}..] 

What I tried

I considered using Hyperloglog for the this. I tried using the Redis implementation. I try to stream the entire data, projecting just the thing I group on, and PFADD it into a corresponding hyperloglog structure in redis.

The logic looks like -

var stream = Model.find({}, {ip: 1, class: 1}).stream();
stream.on('data', function (doc) {
    var hash = "HLL/" + doc.class;
    client.pfadd(hash, doc.ip);
});

I tried to run this for a million plus data points. The size of data to be streamed was around 1GB, with a 1 Gbps connection between Mongo and Node server. I had expected that this code will run fast enough. However, it was pretty slow (slower than counting in MongoDB).

Another thing I thought but didn't implement was to pre-create buckets for each class, and increment them realtime with the data flowing in. But the memory required to support any arbitrary grouping was huge, so had to drop the idea.

Please suggest what I might be doing wrong, or what I could improve here so that I am able take full advantage of hyperloglog (I am not constrained on Redis, and open to any implementation)

hyades
  • 3,110
  • 1
  • 17
  • 36
  • What are your indexes on this collection? – dyouberg Nov 17 '16 at 15:33
  • Your question is had to understand and the title misleading. – styvane Nov 17 '16 at 15:37
  • @dyouberg indexes are on day, name, class if it matters. – hyades Nov 17 '16 at 15:45
  • If you want to use Hyperloglog for counting, you should call `PFADD` each time when you insert an item into MongoDB. The current solution won't be faster, since you still have to read data from MongoDB. – for_stack Nov 18 '16 at 14:25
  • @for_stack If I am getting it correctly, I would be pre-computing the count for every possible value, across all possible things to group on. The memory cost turned out very high here. – hyades Nov 18 '16 at 18:59

1 Answers1

0

Disclaimer: I've only used redis from C++ & Python, so this may not help, but...

PFADD supports multiple arguments; In a system where I was using redis HLL for counting unique entries, I found that batching them and sending a single PFADD with many (of the order of 100) items in one go resulted in a significant speedup - presumably due to avoiding the redis client round-trip.

codedstructure
  • 796
  • 7
  • 16
  • Yes, that is surely one optimization. However, I am using Redis on the same box, so round trips times wont be wasted. Want to know if streaming all data to Redis would be a good idea? – hyades Nov 21 '16 at 14:10
  • I was also using redis on the same box (TCP localhost, didn't try unix sockets) and the speedup from PFADD I mention above was on that setup. – codedstructure Nov 21 '16 at 14:52