2

I wrote a cascading 1.2 program that does the following processing from data of a sensor network:

  1. Read CSV files having 3 columns: millisecond timestamp, event type (either of sensor data, battery level, sensor power state), event body
  2. Round up the millisecond timestamp to nearest second and GroupBy on this value
  3. GroupBy on the event type
  4. Write out the output to a templatetap with the following template: "{rounded timestamp}/{event type}/"

My program runs fine if the amount of log data is small (~300MB), but if I run it with the actual volume of log data produced by the sensor network (~200GB/day) on an EMR cluster, the reducers keep on failing with the following messages: 'Task attempt_201301160001_0003_r_00000X_0 failed to report status for 602 seconds. Killing!'

If I make the template in the template tap static (like "output" instead of "{rounded timestamp}/{event type}/"), the job completes in 3 hours without issues.

Hence, the problem seems to be in template tap!

Perhaps is not being able to handle so many dynamic paths ? (But my understanding was that it keeps ~300 of them open/active at any time using default parameters?)

I did not pass any parameters to the template tap except the path template itself - so all other parameters are at default.

What can I do to be able to make the job work with the "{rounded timestamp}/{event type}/" template?

j0k
  • 22,600
  • 28
  • 79
  • 90
newToFlume
  • 51
  • 1
  • 8

0 Answers0