I am running hadoop 0.20.2 (yes it's a legacy app). I have a simple master-slave setup with 2 nodes. I can start up the cluster fine with jps command on master:
4513 TaskTracker
4225 DataNode
4116 NameNode
4565 Jps
4329 SecondaryNameNode
4410 JobTracker
And jps command on slave:
2409 Jps
2363 TaskTracker
2287 DataNode
However if I run a command which interacts with hdfs like:
hadoop dfs -ls /
it takes a couple of minutes and then one of the datanodes dies. Looking in the log I can see this which is a known bug(the directory is already locked hadoop):
2017-07-05 16:12:59.986 INFO main org.apache.hadoop.hdfs.server.common.Storage - Cannot lock storage /srv/shared/hadoop/dfs/data. The directory is already locked.
Cannot lock storage /srv/shared/hadoop/dfs/data. The directory is already locked.
I have tried stopping all daemons and deleting dfs/data and formatting the namenode. After doing that I can successfully start the cluster again with everything up but as soon as I interact with hdfs or run a MR job a datanode dies.
The exact steps I am taking according to other posts are: 1. stop all daemons 2. delete dfs/data dir 3. run hadoop namenode -format 4. start all daemons
Not sure what else I can try.