I have a collection in MongoDB with a sample doc as follows -
{
"_id" : ObjectId("58114e5e43d6420b7db4e15c"),
"browser" : "Chrome",
"name": "hyades",
"country" : "in",
"day" : "16-10-21",
"ip" : "0.0.0.0",
"class" : "A123"
}
Problem Statement
I should be able to group on any of the fields while fetching the distinct number of IPs.
The aggregation query -
[
{$group: {_id: '$class', ip_arr: {$addToSet: '$ip'}}},
{$project: {class: '$_id.class', ip: {$size: '$ip_arr'}}}
]
gives the desired results, but is slow. Similarly counting ip
using another $group
is slow. The output is -
[{class: "A123",ip: 42},{class: "B123", ip: 56}..]
What I tried
I considered using Hyperloglog for the this. I tried using the Redis implementation. I try to stream the entire data, projecting just the thing I group on, and PFADD
it into a corresponding hyperloglog structure in redis.
The logic looks like -
var stream = Model.find({}, {ip: 1, class: 1}).stream();
stream.on('data', function (doc) {
var hash = "HLL/" + doc.class;
client.pfadd(hash, doc.ip);
});
I tried to run this for a million plus data points. The size of data to be streamed was around 1GB, with a 1 Gbps connection between Mongo and Node server. I had expected that this code will run fast enough. However, it was pretty slow (slower than counting in MongoDB).
Another thing I thought but didn't implement was to pre-create buckets for each class, and increment them realtime with the data flowing in. But the memory required to support any arbitrary grouping was huge, so had to drop the idea.
Please suggest what I might be doing wrong, or what I could improve here so that I am able take full advantage of hyperloglog (I am not constrained on Redis, and open to any implementation)