0

I have a two-node GlusterFS setup. Each one has 2 replicates. One of the system was overloaded somehow. Then things started to go wrong. Currently I have all application shutdown. I'm short of idea how to bring it back. I can start the volume, but some files seems to be corrupted.

I ran gluster volume heal kvm1, Now gluster volume heal kvm1 info shows a long list of "gfid", such as

<gfid:57d68ac5-5ae7-4d14-a65e-9b6bbe0f83a3>
<gfid:c725a364-93c5-4d98-9887-bc970412f124>
<gfid:8178c200-4c9a-407b-8954-08042e45bfce>
<gfid:b28866fa-6d29-4d2d-9f71-571a7f0403bd>

I'm not sure it is actually 'healing' anything. The number of entries has been steady. How can I confirm the healing process is actually working?

# gluster volume heal kvm1 info|egrep 'Brick|entries'
Brick f24p:/data/glusterfs/kvm1/brick1/brick
Number of entries: 5
Brick f23p:/data/glusterfs/kvm1/brick1/brick
Number of entries: 216
Brick f23p:/bricks/brick1/kvm1
Number of entries: 6
Brick f24p:/bricks/brick2/kvm1
Number of entries: 1

# gluster volume status
Status of volume: kvm1
Gluster process                                         Port    Online  Pid
------------------------------------------------------------------------------
Brick f24p:/data/glusterfs/kvm1/brick1/brick       49160   Y       5937
Brick f23p:/data/glusterfs/kvm1/brick1/brick       49153   Y       5766
Brick f23p:/bricks/brick1/kvm1                     49154   Y       5770
Brick f24p:/bricks/brick2/kvm1                     49161   Y       5941
NFS Server on localhost                            2049    Y       5785
Self-heal Daemon on localhost                      N/A     Y       5789
NFS Server on f24p                                 2049    Y       5919
Self-heal Daemon on f24p                           N/A     Y       5923

There are no active volume tasks
Billy K
  • 121
  • 1
  • 3
  • 16

2 Answers2

2

I was in the same state:

  • 2 replicates
  • gluster volume heal myVolume info was showing gfid on one of the bricks

I found this script (resolves gfid into filepath) https://gist.github.com/semiosis/4392640

My interpretation is the folowing (ie: your first line of gfid) On the node displaying the gfid (result of the gluster command)

The file %yourBrickPath%/.glusterfs/57/d6/57d68ac5-5ae7-4d14-a65e-9b6bbe0f83a3 is a hard link pointing to an inode.

In a normal situation you should have a file (in your production directory) pointing the the same inode and for some reason, this hard link is not present anymore.

I see 2 solutions:

  • You recreate the missing hardlink in your production directory (and you make sure to have the same state on the other node)
  • You have no way to find out what filename it was (it was my case as nothing was on the other node) and you remove %yourBrickPath%/.glusterfs/57/d6/57d68ac5-5ae7-4d14-a65e-9b6bbe0f83a3

edit: The content of the file might help

Amertine
  • 21
  • 2
0

You may have stumbled on this bug if you are running version < 3.7.7:

https://bugzilla.redhat.com/show_bug.cgi?id=1284863

Check if any of your glustershd logs show "Couldn't get xlator xl-0".

The fix is in 3.7.7. However, workarounds would be great if anyone finds such.