I have a hadoop cluster with ~7 machines, and some of the machines were keep going down. Sometimes, the hadoop datanode / jobtracker processes only dies (the machine is still running), and other times, the entire machine goes down.
I haven't really debugged situation like this, so I'm wondering where should I start - like logs that I should look into. log file under /logs/
directory - files like hadoop-dev-datanode-X.log
doesn't seem to have anything useful. also, if the Linux machine goes down, where should I look for the error messages?