When using NFSv4, my client reported that their MPI program sometimes report file cannot open or file not found error.
I compiled a sample MPI-IO program and confirmed that, if the MPI procs on computing nodes are trying to access a same file shared from NFS, the program will fail. After several inspection, it turns out that change NFS mount from v4.1 to v3 eliminated this problem.
I'd still like to use NFSv4 because of its safety and potential speed boost. So I'd like to know what arguments should I add to make it work.
OS: CentOS 7.6 updated to latest, nfs-utils 1.3.0, kernel 3.10.0-957.12.2
Server export:
/home 10.0.214.0/24(rw,no_subtree_check,no_root_squash)
Client fstab:
ib-orion-io1:/home /home nfs defaults,rdma,port=20049,nodev,nosuid 0 2
NFSv4 client mount:
ib-orion-io1:/home on /home type nfs4 (rw,nosuid,nodev,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=rdma,port=20049,timeo=600,retrans=2,sec=sys,clientaddr=10.0.214.11,local_lock=none,addr=10.0.214.5)
NFSv3 client mount
ib-orion-io1:/home on /home type nfs (rw,nosuid,nodev,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=rdma,port=20049,timeo=600,retrans=2,sec=sys,mountaddr=10.0.214.5,mountvers=3,mountproto=tcp,local_lock=none,addr=10.0.214.5)
Error shown on NFSv4 client
Testing simple MPIO program with 112 processes accessing file tttestfile
(Filename can be specified via program argument)
Proc 0: hostname=node001
Proc 0: MPI_File_open failed (Other I/O error , error stack:
ADIO_OPEN(219): open failed on a remote node)
Proc 66: MPI_File_open failed (File does not exist, error stack:
ADIOI_UFS_OPEN(39): File tttestfile does not exist)
Proc 1: MPI_File_open failed (Other I/O error , error stack:
ADIO_OPEN(219): open failed on a remote node)
Proc 84: MPI_File_open failed (File does not exist, error stack:
ADIOI_UFS_OPEN(39): File tttestfile does not exist)
Sample Parallel MPI File IO program is taken from HDF5.
See "==> Sample_mpio.c <==" paragraph in https://support.hdfgroup.org/ftp/HDF5/current/src/unpacked/release_docs/INSTALL_parallel