1

Background first, as always. I administer two HPC systems. Both are running the same OS, in the case CentOS 7. They both mount an NFS share that is provided by a server, which has a number of other shares, running Debian 8. Until recently the share was using PosixACLs and the clients were both showing them as they should.

I recently moved the data for the NFS share used by the HPCs to a new storage server. Both the old and new servers are using ZFS, and are on the same 10G network, so I moved the data via ZFS send/receive. The new server is running Debian 9 and has the same software load out, not counting version numbers of course, as the old one. The /etc/export file entries for the NFS share was copied over with the only change made to the directory name as the ZFS pool had a different name. The new server also has connections for the HPCs Infiniband networks and that is what I used to connect to the NFS shares from each of them.

Once everything was set up on the new server I unmounted the share on each HPC, changed the fstab entries to point to it via the Infiniband networks, and mounted it fresh. This seemed to work with no issues. That is until the users tried to go back to work.

Eventually I found out that the ACLs that were previously showing up on the HPCs were not longer all there. No that is not a typo. The ACLs that are shown, via the +, on the storage server are not all shown on the HPCs. Here are two examples, with user, group, and directory names changed.

First the storage server.

drwxrwxrwx+  2 user1      group1        22 Aug 23  2018 directory
drwxr-xr-x+  8 user2      group2         9 Apr 25  2019 user2-directory
drwxrwxr-x+ 13 user3      group3        21 Jan 17 14:08 user3-directory
drwxrwx---+ 11 user4      group3        11 Feb 14 12:49 shared-directory
drwxrwxr-x   6 user5      group4        10 Mar  4 08:40 user5-directory
drwxr-xr-x   8 user6      group3         8 Jul 16  2019 share2-directory

Next the same 6 directories as seen on the HPCs.

drwxrwxrwx+  2 user1      group1        22 Aug 23  2018 directory
drwxr-xr-x   8 user2      group2         9 Apr 25  2019 user2-directory
drwxrwxr-x  13 user3      group3        21 Jan 17 19:08 user3-directory
drwxrwx---  11 user4      group3        11 Feb 14 17:49 shared-directory
drwxrwxr-x   6 user5      group4        10 Mar  4 13:40 user5-directory
drwxr-xr-x   8 user6      group3         8 Jul 16  2019 share2-directory

I also noticed the time difference for these and am wondering if that is part of the issue. The really odd thing is that when I run getfacl, on the HPCs, on the directories that should have an ACL I get the same output that I see on the storage server. That said my users all report that they still do not have the access that they should, as per the ACLs.

As a final note here are the /etc/exports entries for the share and the and /etc/fstab for one of the HPCs, all names and IPs modified of course.

/zfspool/zfsfilesystem          192.168.4.0/22(rw,crossmnt,nohide,async,no_root_squash,no_subtree_check) 192.168.8.0/22(rw,crossmnt,nohide,async,no_root_squash,no_subtree_check)

The fstab entry.

192.168.4.108:/zfspool/zfsfilesystem        /mnt/zfsfilesystem        nfs          defaults         0 0

I have tried forcing NFSv3 by adding vers=3 to the fstab. No change was noticed. I can also verify that mounting via Ethernet instead of Infiniband makes no difference.

As this is such an odd issue I was hoping someone could help.

Chris Woelkers
  • 298
  • 2
  • 11

0 Answers0