we have a cluster ( *ambari cluster with 3 master machines , 3 workers machines )
name-node is set on master01 and master03 Linux redhat machines version 7.3
we notice that after cluster restart ( machine reboot ) , we cant start the name-node services on both machines ( master01 & master03 )
so we start to investigate this issue .....
Surprisingly we saw that no any fsimage files on master01 or master03 machines!
while files should be under folder - /data/var/hadoop/hdfs/namenode/current/ folder
so in this stage we are stuck without working cluster
my question is
how we can recover/restore the files ( if they are not exist on our machines )
what are the other alternative that we need to do in order to recovery the cluster ?
big question - how it can be that these files was deleted ? ,
any known commends that runs from HDFS user that can delete these files ? or risks the fsimage files?
last very important question - how we can avoid this on the second time ???
background - what is the fsimage file?
fsimage – An fsimage file contains the complete state of the file system at a point in time. Every file system modification is assigned a unique, monotonically increasing transaction ID. An fsimage file represents the file system state after all modifications up to a specific transaction ID.
example of fsimage file from other working cluster
# cd /data/var/hadoop/hdfs/namenode/current/
# du -sh * | grep fsimage
4.0K fsimage_0000000000000000000
4.0K fsimage_0000000000000000000.md5
12K fsimage_0000000000000008921
4.0K fsimage_0000000000000008921.md5