Why can't the GPUs communicate in a multi-GPU server?

Question

This is a Dell PowerEdge r750xa server with 4 Nvidia A40 GPUs, intended for AI applications. While the GPUs work well individually, multi-GPU training jobs or indeed any multi-GPU computational workload fails where at least 2 GPUs have to exchange information, including the simpleIPC and the conjugateGradientMultiDeviceCG CUDA samples (the first one shows mismatching results, the second just hangs).

I have seen online discussions (1, 2, 3), claiming that something called the IOMMU must be turned off. I tried setting the iommu=off and intel_iommu=off Linux kernel flags but it didn't help. I checked the BIOS settings, but there is no option to turn IOMMU off in the BIOS.

score 1 · Answer 1 · answered Nov 10 '21 at 00:35

While there is no explicit "IOMMU off" setting in this BIOS flavour, the problem is still with the BIOS configuration.

In the BIOS, go to "Integrated Devices" and change the "Memory Mapped I/O Base" setting from the default "56TB" to "12TB". This will solve the issue. There is no need to add any extra kernel parameters.

Why can't the GPUs communicate in a multi-GPU server?

1 Answers1