2

I am using Datafu's Hyperloglog UDF to estimate a count of unique ids in my dataset. In this case I have 320 million unique ids that may appear multiple times in my dataset.

Dataset : Country, ID.

Here is my code :

REGISTER datafu-1.2.0.jar;

DEFINE  HyperLogLogPlusPlus datafu.pig.stats.HyperLogLogPlusPlus();

-- id is a UUID, for example : de305d54-75b4-431b-adb2-eb6b9e546014 
all_ids =
LOAD '$data'
USING PigStorage(';') AS (country:chararray, id:chararray);

estimate_unique_ids =
FOREACH (GROUP all_ids BY country)
GENERATE
    'Total Ids' as label,
    HyperLogLogPlusPlus(all_ids) as reach;

STORE estimate_unique_ids INTO '$output' USING PigStorage();

Using 120 reducers I noticed that a majority of them completed within minutes. However a handful of the reducers were overloaded with data and ran forever. I killed them after 24 hours.

I thought Hyperloglog was more efficient than counting for real. What is going wrong here?

mnadig
  • 81
  • 2
  • 7
  • You most likely have a few countries with most of the ids (i.e. your data is skewed). So most of your data is being sent to 1 reducer. It is addressed here (but doesn't seem to be resolved). http://stackoverflow.com/questions/12867846/how-do-you-improve-performance-on-a-pig-job-that-has-very-skewed-data – o-90 Jul 18 '15 at 00:44

1 Answers1

0

In DataFu 1.3.0, an Algebraic implementation of HyperLogLog was added. This allows the UDF to use the combiner and will probably improve performance in skewed situations.

However, in the comments in the Jira issue there is a discussion of some other performance problems that can arise when using HyperLogLog. The relevant quote is below:

The thing to keep in mind is that each instance of HyperLogLogPlus allocates a pretty large byte array. I can't remember the exact numbers, but I think for the default precision of 20 it is hundreds of KB. So in your example if the cardinality of "a" is large you are going to allocate a lot of large byte arrays that will need to be transmitted from combiner to reducer. So I would avoid using it in "group by" situations unless you know the key cardinality is quite small. This UDF is better suited for "group all" scenarios where you have a lot of input data. Also if the input data is much smaller than the byte array then you could be worse off using this UDF. If you can accept worse precision then the byte array could be made smaller.

Eyal
  • 3,412
  • 1
  • 44
  • 60