Does hadoop really handle datanode failure?

Question

In our hadoop setup, when a datanode crashes (or) hadoop doesn't respond on the datanode, reduce task fails unable to read from the failed node(exception below). I thought hadoop handles data node failures and that is the main purpose of creating hadoop. Is anybody facing similar problem with their clusters? If you have a solution, please let me know.

java.net.SocketTimeoutException: Read timed out
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.read(Unknown Source)
    at java.io.BufferedInputStream.fill(Unknown Source)
    at java.io.BufferedInputStream.read1(Unknown Source)
    at java.io.BufferedInputStream.read(Unknown Source)
    at sun.net.www.http.HttpClient.parseHTTPHeader(Unknown Source)
    at sun.net.www.http.HttpClient.parseHTTP(Unknown Source)
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1547)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.setupSecureConnection(ReduceTask.java:1483)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1391)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1302)
    at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1234)

"The NameNode marks DataNodes without recent Heartbeats as dead and does not forward any new IO requests to them. Any data that was registered to a dead DataNode is not available to HDFS any more." [HDFS Architecture](http://hadoop.apache.org/common/docs/r0.20.2/hdfs_design.html) — David Schwartz, Nov 28 '11 at 20:43
What is your replication factor? How many reducers did you have? How many Datanodes? — wlk, Nov 28 '11 at 21:20

score 2 · Answer 1 · answered Nov 28 '11 at 21:27

When a task of a mapreduce job fails Hadoop will retry it on another node You can take a look at the jobtracker (:50030/jobtracker.jsp) and see the blacklisted nodes(nodes that have problems with their keep-alive) or drill to a running/completed job and see the number of killed tasks/retries as well as deadnodes, decommisioned nodes etc.

score 1 · Answer 2 · answered Dec 03 '11 at 19:16

I've had a similar problem on a cluster where executing tasks failed on some nodes due to "out of memory" problems. They were definitely restarted on other nodes. The computation eventually failed because it was badly designed, causing all nodes to run out of memory, and eventually the threshold for cancelling the job was reached.

Does hadoop really handle datanode failure?

2 Answers2