Restore/recover the fsimage files

Question

we have a cluster ( *ambari cluster with 3 master machines , 3 workers machines )

name-node is set on master01 and master03 Linux redhat machines version 7.3

we notice that after cluster restart ( machine reboot ) , we cant start the name-node services on both machines ( master01 & master03 )

so we start to investigate this issue .....

Surprisingly we saw that no any fsimage files on master01 or master03 machines!

while files should be under folder - /data/var/hadoop/hdfs/namenode/current/ folder

so in this stage we are stuck without working cluster

my question is

how we can recover/restore the files ( if they are not exist on our machines )
what are the other alternative that we need to do in order to recovery the cluster ?
big question - how it can be that these files was deleted ? ,
any known commends that runs from HDFS user that can delete these files ? or risks the fsimage files?

last very important question - how we can avoid this on the second time ???

background - what is the fsimage file?

fsimage – An fsimage file contains the complete state of the file system at a point in time. Every file system modification is assigned a unique, monotonically increasing transaction ID. An fsimage file represents the file system state after all modifications up to a specific transaction ID.

example of fsimage file from other working cluster

# cd /data/var/hadoop/hdfs/namenode/current/
# du -sh * | grep fsimage
4.0K    fsimage_0000000000000000000
4.0K    fsimage_0000000000000000000.md5
12K     fsimage_0000000000000008921
4.0K    fsimage_0000000000000008921.md5

Don't you need Zookeeper and a QJM to maintain your namenode HA status? — OneCricketeer, Jan 25 '18 at 03:01
Also, if you're namenode data volume isn't mirrored to a separate disk, or using RAID, then you're just missing the files, and there's likely nothing to recover — OneCricketeer, Jan 25 '18 at 03:04
about - Zookeeper and a QJM to maintain your namenode HA status , can you explain more what you mean ? , Zookeeper already started on each machine before we start the name node, — enodmilvado, Jan 25 '18 at 05:14
my master machine are VM machine with VMDK disk , so I dont think this VMDK have mirroring — enodmilvado, Jan 25 '18 at 05:15
question - its very strange that all these files was deleted , do you know if some commands thta runs on HDFS user can remove by mistake these files ? — enodmilvado, Jan 25 '18 at 05:17
Zookeeper is used to maintain the primary namenode leader election. The QJM is used by the JournalNodes to maintain edit logs... And no, I don't think there's a process that can remove the files, at least not by accident. But there's multiple reasons why a editlog wouldn't be made. The fsimage could also not be completely written to disk if you don't shut down the cluster correctly. Such as sleeping or just shutting down a VM — OneCricketeer, Jan 25 '18 at 05:56

Restore/recover the fsimage files

0 Answers0