1

I'd like a small function to detect whether a given computer has a CUDA-enabled GPU available, such as the following.

#include <stdio.h>
#include <cuda_runtime.h>

int main() {
    int deviceCount;
    struct cudaDeviceProp properties;
    cudaError_t cudaResultCode = cudaGetDeviceCount(&deviceCount);
    if (cudaResultCode != cudaSuccess)
        deviceCount = 0;
    printf("%d GPU CUDA device(s) found\n", deviceCount);
}

On a machine without a GPU plugged in (but with the CUDA libraries installed), this code triggers the driver to log a message to stderr.

$ ./a.out
FATAL: Error inserting nvidia (/lib/modules/2.6.32-504.16.2.el6.x86_64/extra/nvidia.ko): No such device
0 GPU CUDA device(s) found

Is there any way to prevent this message from being printed by the driver? (on machines with no GPU and without closing stderr or other hacks like that)

Robert T. McGibbon
  • 5,075
  • 3
  • 37
  • 45
  • 2
    It's not clear why the driver is active on a machine that has no GPU. I have various RHEL 6 machines and they don't have `nvidia.ko` in the location you indicate. I compiled and ran your code on a machine that has never had the NVIDIA driver installed (but has CUDA installed, including the libraries obviously), and it reported the proper printout from the code with nothing sent to stderr. – Robert Crovella Jul 03 '15 at 23:00
  • Just to confirm what we're assuming, run your program with stdout and stderr redirected (to separate files), just to make sure that it's your process emitting the message. –  Jul 03 '15 at 23:09
  • 1. Yes, with both stdout and stderr redirected, the error message appears in the stderr file. – Robert T. McGibbon Jul 03 '15 at 23:22
  • 2. The machine is a large cluster with a shared filesystem -- some of the nodes have GPUs and other don't, but they all have the same filesystem and that nvidia.ko file is present regardless. – Robert T. McGibbon Jul 03 '15 at 23:23
  • Given that your linux kernel is configured for a non-existent GPU, one option you might consider (not really an answer to your question) would be to use some other method (to "to detect whether a given computer has a CUDA-enabled GPU available"). For example you could do something equivalent to `lspci |grep -i nvidia` and see if there *actually is* an NVIDIA GPU in the system. If there is, then perhaps run the above test. The above test might become somewhat redundant in that case, but it could still detect some remaining errors/corner cases, such as a GPU driver/CUDA runtime mismatch. – Robert Crovella Jul 04 '15 at 01:14
  • 1
    If your cluster is being managed via a job scheduler, it's possible the job scheduler may also provide this information for you. For example, if a user requests GPU-enabled nodes in Torque, it's possible to configure Torque [to report the specifics](http://docs.adaptivecomputing.com/torque/4-0-2/Content/topics/3-nodes/schedulingGPUs.htm) in `$PBS_GPUFILE` – Robert Crovella Jul 04 '15 at 01:46
  • 2
    You can first [check whether the `nvidia` module is loaded](http://stackoverflow.com/q/12978794/929437) before calling `cudaGetDeviceCount`. But beware: if correct kernel module is available, just not currently loaded, this check will fail. – aland Jul 04 '15 at 07:29
  • 3
    I am 99% sure that message isn't coming from the NVIDIA driver, but from the kernel itself during the module insertion. The driver is returning an error string to the kernel which goes along with the error message, but the actual echoing of the error message to stderr comes from the kernel `init_module` call. You might want to see whether setting `MODPROBE_OPTIONS ` to redirect to the syslog would have any effect on this behaviour, but I doubt it will. – talonmies Jul 04 '15 at 10:38

1 Answers1

1

The easiest way to accomplish this is to redirect stderr to /dev/null (or to an error log file). Details here: http://tldp.org/HOWTO/Bash-Prog-Intro-HOWTO-3.html

Then the only thing printed will be your message to stdout.

It'sPete
  • 5,083
  • 8
  • 39
  • 72
  • 1
    Sorry, I can't use redirection. The actual problem involves having this code inside a much larger library -- if at all possible I want to avoid forking off a subprocess for this purpose. – Robert T. McGibbon Jul 03 '15 at 22:53