How to track lots of statistics with map-reduce cascading?

Question

I have a series map-reduce jobs to process user data (implemented using the Cascading framework), and I would like to track lots of fine-grained statistics (I can have between 100 and 1000 users and 20 statistics per user, so, possibly between 5000 and 10.000 statistics in total). I wanted to use map-reduce counters to build those stats because it is very convenient to use them in the code, but there is a limit to the number of map-reduce counters (120 by default), and according to this post: http://developer.yahoo.com/blogs/hadoop/posts/2010/08/apache_hadoop_best_practices_a/ I should not use them if i have more than 20/50 custom counters.

Question: is there a proper way to track my statistics in this map-reduce context, using a counter-like pattern ? by counter-like, i mean, to have access to counters everywhere in my code and be able to increment them where needed.

thanks by advance register

score 0 · Accepted Answer · answered Sep 12 '12 at 10:44

0

If your statistics are just counts and they get only incremented in the parallel stage, you could collect them separately for each instance and then add up together (reduce). This is the whole idea of MapReduce, actually.

answered Sep 12 '12 at 10:44

Qnan

3,714
18
15

yes, thanks. I finally did that, creating specific Cascading functions to generate events , , , all these events then all flow to a unique pipe which aggregate them by the dimension+counter_name, and sum associated increment values. – register Sep 15 '12 at 13:02

How to track lots of statistics with map-reduce cascading?

1 Answers1