Fault Tolerance in MapReduce

Question

I was reading about Hadoop and how fault tolerant it is. I read the HDFS and read how failure of master and slave nodes can be handled. However, i couldnt find any document that mentions how the mapreduce performs fault tolerance. Particularly, what happens when the Master node containing Job Tracker goes down or any of the slave nodes goes down?

If anyone can point me to some links and references that explains this in detail.

score 5 · Answer 1 · answered Mar 25 '15 at 19:23

FAULT TOLERANCE IN HADOOP

In case the JobTracker does not receive any heartbeat from a TaskTracker for a specified period of time (by default, it is set to 10 minutes), the JobTracker understands that the worker associated to that TaskTracker has failed. When this situation happens, the JobTracker needs to reschedule all pending and in progress tasks to another TaskTracker, because the intermediate data belonging to the failed TaskTracker may not be available anymore.

All completed map tasks need also to be rescheduled if they belong to incomplete jobs, because the intermediate results residing in the failed TaskTracker file system may not be accessible to the reduce task.

A TaskTracker can also be blacklisted. In this case, the blacklisted TaskTracker remains in communication with the JobTracker, but no tasks are assigned to the corresponding worker. When a given number of tasks (by default, this number is set to 4) belonging to a specific job managed by a TaskTracker fails, the system considers that a fault has occurred.

Some of the relevant information in the heartbeats the TaskTracker sends to the JobTracker are: ● The TaskTrackerStatus

● Restarted

● If it is the first heartbeat

● If the node requires more tasks to execute

The TaskTrackerStatus contains information about the worker managed by the TaskTracker, such as available virtual and physical memory and information about the CPU. The JobTracker keeps the blacklist with the faulty TaskTracker and also the last heartbeat received from that TaskTracker. So, when a new restarted/first heartbeat is received, the JobTracker, by using this information, may decide whether to restart the TaskTracker or to remove the TaskTracker from the blacklist

After that, the status of the TaskTracker is updated in the JobTracker and a HeartbeatResponse is created. This HeartbeatResponse contains the next actions to be taken by the TaskTracker . If there are tasks to perform, the TaskTracker requires new tasks (this is a parameter of the Heartbeat) and it is not in the blacklist, then cleanup tasks and setup tasks are created (the cleanup/setup mechanisms have not been further investigated yet). In case there are not cleanup or setup tasks to perform, the JobTracker gets new tasks. When tasks are available, the LunchTaskAction is encapsulated in each of them, and then the JobTracker also looks up for:

-Tasks to be killed

-Jobs to kill/cleanup

-Tasks whose output has not yet been saved.

All this actions, if they apply, are added to the list of actions to be sent in the HeartbeatResponse. The fault tolerance mechanisms implemented in Hadoop are limited to reassign tasks when a given execution fails. In this situation, two scenarios are supported: 1. In case a task assigned to a given TaskTracker fails, a communication via the Heartbeat is used to notify the JobTracker, which will reassign the task to another node if possible. 2. If a TaskTracker fails, the JobTracker will notice the faulty situation because it will not receive the Heartbeats from that TaskTracker. Then, the JobTracker will assign the tasks the TaskTracker had to another TaskTracker. There is also a single point of failure in the JobTracker, since if it fails, the whole execution fails.

The main benefits of the standard approach for fault tolerance implemented in Hadoop consists on its simplicity and that it seems to work well in local clusters However, the standard approach is not enough for large distributed infrastructures the distance between nodes may be too big, and the time lost in reassigning a task may slow the system

score 5 · Accepted Answer · answered Apr 28 '11 at 10:59

Fault Tolerance of MapReduce layer depends on the hadoop version. For versions before hadoop.0.21, no checkpointing was done and failure of JobTracker would lead to loss of data.

However, versions starting hadoop.0.21, checkpointing was added where JobTracker records its progress in a file. When a JobTracker starts up, it looks for such data, so that it can restart work from where it left off.

score 3 · Answer 3 · answered Apr 28 '11 at 04:53

The Master node (NameNode) is a single point of failure in hadoop. If it goes down, the system is unavailable.

Slave (Computational) node failures are fine, and anything running on them at the time of failure are simply rerun on a different node. In fact this may occur even if a node is running slowly.

There are some projects / companies looking to eliminate the single point of failure. Googling "hadoop ha" (High availablity) should get you on your way if you're interested.

Fault Tolerance in MapReduce

3 Answers3