I am running analysis on a cluster and internally I am spawning some processes. Most of the times it works, but sometimes I get following error:
mm_xpmem.c:135 UCX ERROR failed to attach xpmem apid 0x600005c0e offset 0x2b8cb9183000 length 12288: No such file or directory
mm_ep.c:172 UCX ERROR mm ep failed to connect to remote FIFO id 0x2b8cb9183000: Input/output error
This error is raised randomly. What is the cause for this error and how can this be resolved?
OpenMPI: 4.0.5
mpi4py: 3.1.3