I have a hadoop reduce task attempt that will never fail or get completed unless I fail/kill it manually.
The problem surfaces when the task tracker node (due to network issues that I am still investigating) looses connectivity with other task trackers/data nodes, but not with the job tracker.
Basically, the reduce task is not able to fetch the necessary data from other data nodes due to time out issues and it blacklists them. So far, so good, the blacklisting is expected and needed, the problem is that it will keep retry the same blacklisted hosts for hours (honoring what it seems to be an exponential back-off algorithm) until I manually kill it. Latest long running task had been >9 hours retrying.
I see hundreds of messages like these in the log:
2013-09-09 22:34:47,251 WARN org.apache.hadoop.mapred.ReduceTask (MapOutputCopier attempt_201309091958_0004_r_000044_0.1): attempt_201309091958_0004_r_000044_0 copy failed: attempt_201309091958_0004_m_001100_0 from X.X.X.X
2013-09-09 22:34:47,252 WARN org.apache.hadoop.mapred.ReduceTask (MapOutputCopier attempt_201309091958_0004_r_000044_0.1): java.net.SocketTimeoutException: connect timed out
Is there any way or setting to specify that after n retries or seconds the task should fail on and get restarted on its own in another task tracker host?
These are some of the relevant reduce/timeout Hadoop cluster parameters I have set in my cluster:
<property><name>mapreduce.reduce.shuffle.connect.timeout</name><value>180000</value></property>
<property><name>mapreduce.reduce.shuffle.read.timeout</name><value>180000</value></property>
<property><name>mapreduce.reduce.shuffle.maxfetchfailures</name><value>10</value></property>
<property><name>mapred.task.timeout</name><value>600000</value></property>
<property><name>mapred.jobtracker.blacklist.fault-timeout-window</name><value>180</value></property>
<property><name>mapred.healthChecker.script.timeout</name><value>600000</value></property>
BTW, this job runs on an AWS EMR cluster (Hadoop version: 0.20.205).
Thanks in advance.