I wrote a cascading 1.2 program that does the following processing from data of a sensor network:
- Read CSV files having 3 columns: millisecond timestamp, event type (either of sensor data, battery level, sensor power state), event body
- Round up the millisecond timestamp to nearest second and GroupBy on this value
- GroupBy on the event type
- Write out the output to a templatetap with the following template: "{rounded timestamp}/{event type}/"
My program runs fine if the amount of log data is small (~300MB), but if I run it with the actual volume of log data produced by the sensor network (~200GB/day) on an EMR cluster, the reducers keep on failing with the following messages: 'Task attempt_201301160001_0003_r_00000X_0 failed to report status for 602 seconds. Killing!'
If I make the template in the template tap static (like "output" instead of "{rounded timestamp}/{event type}/"), the job completes in 3 hours without issues.
Hence, the problem seems to be in template tap!
Perhaps is not being able to handle so many dynamic paths ? (But my understanding was that it keeps ~300 of them open/active at any time using default parameters?)
I did not pass any parameters to the template tap except the path template itself - so all other parameters are at default.
What can I do to be able to make the job work with the "{rounded timestamp}/{event type}/" template?