I am using Datafu's Hyperloglog UDF to estimate a count of unique ids in my dataset. In this case I have 320 million unique ids that may appear multiple times in my dataset.
Dataset : Country, ID.
Here is my code :
REGISTER datafu-1.2.0.jar;
DEFINE HyperLogLogPlusPlus datafu.pig.stats.HyperLogLogPlusPlus();
-- id is a UUID, for example : de305d54-75b4-431b-adb2-eb6b9e546014
all_ids =
LOAD '$data'
USING PigStorage(';') AS (country:chararray, id:chararray);
estimate_unique_ids =
FOREACH (GROUP all_ids BY country)
GENERATE
'Total Ids' as label,
HyperLogLogPlusPlus(all_ids) as reach;
STORE estimate_unique_ids INTO '$output' USING PigStorage();
Using 120 reducers I noticed that a majority of them completed within minutes. However a handful of the reducers were overloaded with data and ran forever. I killed them after 24 hours.
I thought Hyperloglog was more efficient than counting for real. What is going wrong here?