0

I have a Spark application in Java, running on AWS EMR. I have implemented an AutoScaling policy based on the available yarn memory. For jobs which require higher memory, EMR brings up cluster up to 1+8 nodes.

After a point of time in my job I keep getting the below error, this error goes on for hours before I terminate cluster manually.

java.io.IOException: All datanodes [DatanodeInfoWithStorage[<i.p>:50010,DS-4e7690c7-5946-49c5-b203-b5166c2ff58d,DISK]] are bad. Aborting...
at org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1531)
at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1465)
at org.apache.hadoop.hdfs.DataStreamer.processDatanodeError(DataStreamer.java:1237)
at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:657)

This error is in the very first worker node that was spawned. After some digging, I found out this might be because of ulimit. Now increasing ulimit can be done easily on any Linux or EC2 machines manually. But I am unable to get how to do this dynamically every EMR cluster that is spawned.

Further, I am not even 100% sure if ulimit is causing this particular issue. This might be something else as well. I can confirm only once I change ulimit and check.

Mehaboob Khan
  • 343
  • 1
  • 5
  • 18

0 Answers0