0

I´m having repeated crashes in my Cloudera cluster HDFS Datanodes due to an OutOfMemoryError:

java.lang.OutOfMemoryError: Java heap space
Dumping heap to /tmp/hdfs_hdfs-DATANODE-e26e098f77ad7085a5dbf0d369107220_pid18551.hprof ...
Heap dump file created [2487730300 bytes in 16.574 secs]
#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="/usr/lib64/cmf/service/common/killparent.sh"
#   Executing /bin/sh -c "/usr/lib64/cmf/service/common/killparent.sh"...
18551 TS   19 ?        00:25:37 java
Wed Aug  7 11:44:54 UTC 2019
JAVA_HOME=/usr/lib/jvm/java-openjdk
using /usr/lib/jvm/java-openjdk as JAVA_HOME
using 5 as CDH_VERSION
using /run/cloudera-scm-agent/process/3087-hdfs-DATANODE as CONF_DIR
using  as SECURE_USER
using  as SECURE_GROUP
CONF_DIR=/run/cloudera-scm-agent/process/3087-hdfs-DATANODE
CMF_CONF_DIR=/etc/cloudera-scm-agent
4194304

When analyzing the heap dump, the apparent biggest suspects are millions of instances of ScanInfo apparently quequed in the ExecutorService of the class org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.

Eclipse MAT tool showing the dominator tree

When I inspect the content of each ScanInfo runnable object, I don´t see anything weird:

ScanInfo instance content

Apart from this and a bit high block count in HDFS, I don´t get any other information apart from the different DataNodes crashing randomly in my cluster.

Any idea why these objects keep queueing up in the DirectoryScanner thread pool?

Victor
  • 2,450
  • 2
  • 23
  • 54
  • 1
    Does this happen regularly? I had similar problems with the [DataNode Block Scanner](https://www.hadoopinrealworld.com/datanode-block-scanner/) running and having some problems with high amounts of very small files. For solving that we had two options: increase the memory (being careful about the 31 / 32 GB border where java uses bigger pointers and therefore needs *a lot* more RAM suddenly) or reduce the amount of small files (for example by removing trash / merging files). – Secespitus Aug 09 '19 at 07:21
  • Hi @Secespitus, yes it happens quite often. We have a specific job that merges parquet files but we can not get any lower than the 2x the warning threshold. Regarding the memory, I am going to try to increase it to a figure closer to the one you mention. Thanks! – Victor Aug 09 '19 at 11:41

1 Answers1

1

You can try once below command.

$ hadoop dfsadmin -finalizeUpgrade The -finalizeUpgrade command removes the previous version of the NameNode’s and DataNodes’ storage directories.

MadProgrammer
  • 513
  • 5
  • 18
  • And how does this help? I am not upgrading any node – Victor Aug 09 '19 at 19:13
  • After use this command ..Some part of the memory will get reduce, Because definition of finalize is used for clean up operations. If it will not work then probably you have to increase the heap size. – MadProgrammer Aug 10 '19 at 06:54
  • You can assign more memory by editing the conf/mapred-site.xml file and adding the property: mapred.child.java.opts -Xmx1024m This will start the hadoop JVMs with more heap space. – MadProgrammer Aug 10 '19 at 07:00