FAULT TOLERANCE IN HADOOP
In case the JobTracker does not receive any heartbeat from a TaskTracker for a specified period of
time (by default, it is set to 10 minutes), the JobTracker understands that the worker associated to
that TaskTracker has failed. When this situation happens, the JobTracker needs to reschedule all
pending and in progress tasks to another TaskTracker, because the intermediate data belonging to
the failed TaskTracker may not be available anymore.
All completed map tasks need also to be
rescheduled if they belong to incomplete jobs, because the intermediate results residing in the failed
TaskTracker file system may not be accessible to the reduce task.
A TaskTracker can also be blacklisted. In this case, the blacklisted TaskTracker remains in
communication with the JobTracker, but no tasks are assigned to the corresponding worker. When a
given number of tasks (by default, this number is set to 4) belonging to a specific job managed by a
TaskTracker fails, the system considers that a fault has occurred.
Some of the relevant information in the heartbeats the TaskTracker sends to the JobTracker are:
● The TaskTrackerStatus
● Restarted
● If it is the first heartbeat
● If the node requires more tasks to execute
The TaskTrackerStatus contains information about the worker managed by the TaskTracker, such as
available virtual and physical memory and information about the CPU.
The JobTracker keeps the blacklist with the faulty TaskTracker and also the last heartbeat received
from that TaskTracker. So, when a new restarted/first heartbeat is received, the JobTracker, by using
this information, may decide whether to restart the TaskTracker or to remove the TaskTracker from
the blacklist
After that, the status of the TaskTracker is updated in the JobTracker and a HeartbeatResponse is
created. This HeartbeatResponse contains the next actions to be taken by the TaskTracker .
If there are tasks to perform, the TaskTracker requires new tasks (this is a parameter of the
Heartbeat) and it is not in the blacklist, then cleanup tasks and setup tasks are created (the
cleanup/setup mechanisms have not been further investigated yet). In case there are not cleanup or
setup tasks to perform, the JobTracker gets new tasks. When tasks are available, the
LunchTaskAction is encapsulated in each of them, and then the JobTracker also looks up for:
-Tasks to be killed
-Jobs to kill/cleanup
-Tasks whose output has not yet been saved.
All this actions, if they apply, are added to the list of actions to be sent in the HeartbeatResponse.
The fault tolerance mechanisms implemented in Hadoop are limited to reassign tasks when a given
execution fails. In this situation, two scenarios are supported:
1. In case a task assigned to a given TaskTracker fails, a communication via the Heartbeat is
used to notify the JobTracker, which will reassign the task to another node if possible.
2. If a TaskTracker fails, the JobTracker will notice the faulty situation because it will not receive
the Heartbeats from that TaskTracker. Then, the JobTracker will assign the tasks the TaskTracker had
to another TaskTracker.
There is also a single point of failure in the JobTracker, since if it fails, the whole execution fails.
The main benefits of the standard approach for fault tolerance implemented in Hadoop consists on its simplicity and that it seems to work well in local clusters However, the standard approach is not enough for large distributed infrastructures the distance between nodes may be too big, and the time lost in reassigning a task may slow the system