On one of our platforms, HDFS namenode is shutting down with following error message every 1 or 3 days
FATAL namenode.FSEditLog (JournalSet.java:mapJournalsAndReportErrors(390)) - Error: flush failed for required journal (JournalAndStream(mgr=QJM to [<ip1>:<port>,<ip2>:<port>, etc], stream=QuorumOutputStream starting at txid 29873171))
java.io.IOException: Timed out waiting 20000ms for a quorum of nodes to respond.
at org.apache.hadoop.hdfs.qjournal.client.AsyncLoggerSet.waitForWriteQuorum(AsyncLoggerSet.java:137)
at org.apache.hadoop.hdfs.qjournal.client.QuorumOutputStream.flushAndSync(QuorumOutputStream.java:109)
at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:113)
at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:107)
at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream$8.apply(JournalSet.java:525)
at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:385)
at org.apache.hadoop.hdfs.server.namenode.JournalSet.access$100(JournalSet.java:55)
at org.apache.hadoop.hdfs.server.namenode.JournalSet$JournalSetOutputStream.flush(JournalSet.java:521)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:710)
at org.apache.hadoop.hdfs.server.namenode.FSEditLogAsync.run(FSEditLogAsync.java:188)
at java.lang.Thread.run(Thread.java:748)
Before this FATAL log we can see following kind of logs, on which we can detect a degradation of response time
WARN client.QuorumJournalManager (QuorumCall.java:waitFor(185)) - Waited 18014 ms (timeout=20000 ms) for a response for sendEdits. Succeeded so far: [<ip1>:<port>,<ip2>:<port>]
Have you already encountered this problem, and do you have any advices to fix it ?
We have already:
- checked that our VMs are time synchronized
- detected that when the problem occurs a burst of data on the network is in progress, without detecting the root cause yet
- checked our network devices. Except a problem on a port which goes from UP state to DOWN state quickly that we are going to fix, the network seems correct
Thanks in advance