Environment: The production cluster has 2 name-nodes (active and standby namely) and the nodes are SAS drives in Raid-1 configuration. These nodes have nothing but the master services (NN and Standby NN) running on each. They have a Ram of 256GB while the data nodes (where most of the processing happens) are set with only 128GB.
My Question: Why does Hadoop’s master nodes have such high Ram and why not the Datanodes when most of the processing is done where the data is available.?
P.S. As per hadoop thumb rule, we only require 1GB for every 1million files.