24

I wrote a mapreduce job to extract some info from a dataset. The dataset is users' rating about movies. The number of users is about 250K and the number of movies is about 300k. The output of map is <user, <movie, rating>*> and <movie,<user,rating>*>. In the reducer, I will process these pairs.

But when I run the job, the mapper completes as expected, but reducer always complain that

Task attempt_* failed to report status for 600 seconds.

I know this is due to failed to update status, so I added a call to context.progress() in my code like this:

int count = 0;
while (values.hasNext()) {
  if (count++ % 100 == 0) {
    context.progress();
  }
  /*other code here*/
}

Unfortunately, this does not help. Still many reduce tasks failed.

Here is the log:

Task attempt_201104251139_0295_r_000014_1 failed to report status for 600 seconds. Killing!
11/05/03 10:09:09 INFO mapred.JobClient: Task Id : attempt_201104251139_0295_r_000012_1, Status : FAILED
Task attempt_201104251139_0295_r_000012_1 failed to report status for 600 seconds. Killing!
11/05/03 10:09:09 INFO mapred.JobClient: Task Id : attempt_201104251139_0295_r_000006_1, Status : FAILED
Task attempt_201104251139_0295_r_000006_1 failed to report status for 600 seconds. Killing!

BTW, the error happened in reduce to copy phase, the log says:

reduce > copy (28 of 31 at 26.69 MB/s) > :Lost task tracker: tracker_hadoop-56:localhost/127.0.0.1:34385

Thanks for the help.

Leif Wickland
  • 3,693
  • 26
  • 43
user572138
  • 463
  • 4
  • 6
  • 13
  • You can consider issuing the context.progress() call more frequently. Your code should work as long as the time between context.progress() calls does not exceed the limit (600 seconds in your configuration). – cabad Jul 18 '13 at 22:09

5 Answers5

26

The easiest way will be to set this configuration parameter:

<property>
  <name>mapred.task.timeout</name>
  <value>1800000</value> <!-- 30 minutes -->
</property>

in mapred-site.xml

wlk
  • 5,695
  • 6
  • 54
  • 72
  • Thanks for your ans. Still I am not sure about one thing. The logs says "reduce > copy (28 of 31 at 26.69 MB/s) > :Lost task tracker: tracker_hadoop-56:localhost/127.0.0.1:34385". What does this mean? – user572138 May 03 '11 at 06:15
  • Haha, yeah that is fixed arround the problem. This says, that your tasktracker has vanished/crashed. That can be various problems. Have a look into the logs. I assume that your file descriptors emptied. – Thomas Jungblut May 03 '11 at 09:13
  • 2
    This isn't actually a fix. This is a work-around that will run into the same problem if the task is scaled up. – Robert Rapplean Feb 18 '16 at 17:44
15

The easiest another way is to set in your Job Configuration inside the program

 Configuration conf=new Configuration();
 long milliSeconds = 1000*60*60; <default is 600000, likewise can give any value)
 conf.setLong("mapred.task.timeout", milliSeconds);

**before setting it please check inside the Job file(job.xml) file in jobtracker GUI about the correct property name whether its mapred.task.timeout or mapreduce.task.timeout . . . while running the job check in the Job file again whether that property is changed according to the setted value.

Moncy Augustin
  • 151
  • 1
  • 3
  • This approach is probably better, since you might want your regular jobs to timeout at 10 minutes. Configure special needs when needed, and not in the general case. – Alex A. Oct 14 '13 at 20:25
11

In newer versions, the name of the parameter has been changed to mapreduce.task.timeout as described in this link (search for task.timeout). In addition, you can also disable this timeout as described in the above link:

The number of milliseconds before a task will be terminated if it neither reads an input, writes an output, nor updates its status string. A value of 0 disables the timeout.

Below is an example setting in the mapred-site.xml:

<property>
  <name>mapreduce.task.timeout</name>
  <value>0</value> <!-- A value of 0 disables the timeout -->
</property>
keelar
  • 5,814
  • 7
  • 40
  • 79
3

If you have hive query and its timing out , you can set above configurations in following way:

set mapred.tasktracker.expiry.interval=1800000;

set mapred.task.timeout= 1800000;

Animesh Raj Jha
  • 2,704
  • 1
  • 21
  • 25
1

From https://issues.apache.org/jira/browse/HADOOP-1763

causes might be :

1. Tasktrackers run the maps successfully
2. Map outputs are served by jetty servers on the TTs.
3. All the reduce tasks connects to all the TT where maps are run. 
4. since there are lots of reduces wanting to connect the map output server, the jetty servers run out of threads (default 40)
5. tasktrackers continue to make periodic heartbeats to JT, so that they are not dead, but their jetty servers are (temporarily) down.
Nishu Tayal
  • 20,106
  • 8
  • 49
  • 101
Bohdan
  • 16,531
  • 16
  • 74
  • 68