0

we have a cluster ( *ambari cluster with 3 master machines , 3 workers machines )

name-node is set on master01 and master03 Linux redhat machines version 7.3

we notice that after cluster restart ( machine reboot ) , we cant start the name-node services on both machines ( master01 & master03 )

so we start to investigate this issue .....

Surprisingly we saw that no any fsimage files on master01 or master03 machines!

while files should be under folder - /data/var/hadoop/hdfs/namenode/current/ folder

so in this stage we are stuck without working cluster

my question is

  1. how we can recover/restore the files ( if they are not exist on our machines )

  2. what are the other alternative that we need to do in order to recovery the cluster ?

  3. big question - how it can be that these files was deleted ? ,

  4. any known commends that runs from HDFS user that can delete these files ? or risks the fsimage files?

last very important question - how we can avoid this on the second time ???

background - what is the fsimage file?

fsimage – An fsimage file contains the complete state of the file system at a point in time. Every file system modification is assigned a unique, monotonically increasing transaction ID. An fsimage file represents the file system state after all modifications up to a specific transaction ID.

example of fsimage file from other working cluster

# cd /data/var/hadoop/hdfs/namenode/current/
# du -sh * | grep fsimage
4.0K    fsimage_0000000000000000000
4.0K    fsimage_0000000000000000000.md5
12K     fsimage_0000000000000008921
4.0K    fsimage_0000000000000008921.md5
OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
enodmilvado
  • 443
  • 1
  • 9
  • 20
  • Don't you need Zookeeper and a QJM to maintain your namenode HA status? – OneCricketeer Jan 25 '18 at 03:01
  • Also, if you're namenode data volume isn't mirrored to a separate disk, or using RAID, then you're just missing the files, and there's likely nothing to recover – OneCricketeer Jan 25 '18 at 03:04
  • about - Zookeeper and a QJM to maintain your namenode HA status , can you explain more what you mean ? , Zookeeper already started on each machine before we start the name node, – enodmilvado Jan 25 '18 at 05:14
  • my master machine are VM machine with VMDK disk , so I dont think this VMDK have mirroring – enodmilvado Jan 25 '18 at 05:15
  • question - its very strange that all these files was deleted , do you know if some commands thta runs on HDFS user can remove by mistake these files ? – enodmilvado Jan 25 '18 at 05:17
  • Zookeeper is used to maintain the primary namenode leader election. The QJM is used by the JournalNodes to maintain edit logs... And no, I don't think there's a process that can remove the files, at least not by accident. But there's multiple reasons why a editlog wouldn't be made. The fsimage could also not be completely written to disk if you don't shut down the cluster correctly. Such as sleeping or just shutting down a VM – OneCricketeer Jan 25 '18 at 05:56

0 Answers0