I want to share all about our case.
We have Hadoop cluster with 2 name nodes, one active name node, and one standby name node.
After some time we notice that the active name node and secondary name node are down for 3 days.
After reviewing the name node log files, we see that the secondary name node was down for 1 month, and the active name node is down for a couple hours.
The other interesting thing that we see on the active name node log is name node heap size problem, as maybe some of you know is the secondary name node actually support the active name node , but secondary name node isn't replace the active name node.
Therefore we guess that the reason that active name node failed is because the active name node did not get data acknowledge from the secondary name node, and maybe it is the reason of high JVM consuming from the active name node.
I will appreciate stack-overflow users help , and your opinion about our case