This is a Dell PowerEdge r750xa server with 4 Nvidia A40 GPUs, intended for AI applications. While the GPUs work well individually, multi-GPU training jobs or indeed any multi-GPU computational workload fails where at least 2 GPUs have to exchange information, including the simpleIPC and the conjugateGradientMultiDeviceCG CUDA samples (the first one shows mismatching results, the second just hangs).
I have seen online discussions (1, 2, 3), claiming that something called the IOMMU must be turned off. I tried setting the iommu=off
and intel_iommu=off
Linux kernel flags but it didn't help. I checked the BIOS settings, but there is no option to turn IOMMU off in the BIOS.