0

I have a question. Is it possible to identify in which slot there is a broken GPU card using the UBUNTU operating system? We have a SuperMicro GPU server in which there are about 8 GPU cards for AI computing. Every now and then we go to the server room after we get information from users/department that the card is not visible in 'nvidia-smi' command. These are generally hardware failures. Then we encounter a situation where 7 cards are working properly and unfortunately we have to identify the faulty card by trial and error by pulling it from the server. This is terribly tedious and time consuming, so I am wondering if it is possible to unambiguously identify the slot where the faulty card is located.

Thank you in advance.

Herman
  • 1

1 Answers1

0

In general, if you are able to find out which PCI bus address this card has, you can locate the precise slot it occupies. Traverse dmidecode output and find in which slot this PCI address appears.

However, this only helps if you have confidence the PCI slot numbering in DMI is predictable and corresponds to actual physical slots on the motherboard. In brand computers (HPE, Dell, etc.) this is often the case. If the motherboard is manufactured by less reputable brand, its DMI data may be not in sync. Nevertheless, this is worth trying.

Nikita Kipriyanov
  • 10,947
  • 2
  • 24
  • 45