2

I've installed CUDA and CUDNN on my machine (Ubuntu 16.04) alongside tensorflow-gpu.

Versions used: CUDA 10.0, CUDNN 7.6, Python 3.6, Tensorflow 1.14


This is the output from nvidia-smi, showing the video card configuration.

| NVIDIA-SMI 410.78       Driver Version: 410.78       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 960M    On   | 00000000:02:00.0 Off |                  N/A |
| N/A   44C    P8    N/A /  N/A |    675MiB /  4046MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1502      G   /usr/lib/xorg/Xorg                           363MiB |
|    0      3281      G   compiz                                        96MiB |
|    0      4375      G   ...uest-channel-token=14359313252217012722    69MiB |
|    0      5157      C   ...felipe/proj/venv/bin/python3.6            141MiB |
+-----------------------------------------------------------------------------+

This is the output from device_lib.list_local_devices() (tensorflow helper method to show what devices it can see), showing that my GPU is visible to tensorflow:

[name: "/device:CPU:0"
  device_type: "CPU"
  memory_limit: 268435456
  locality {
  }
  incarnation: 5096693727819965430, 
name: "/device:XLA_GPU:0"
  device_type: "XLA_GPU"
  memory_limit: 17179869184
  locality {
  }
  incarnation: 13415556283266501672
  physical_device_desc: "device: XLA_GPU device", 
name: "/device:XLA_CPU:0"
  device_type: "XLA_CPU"
  memory_limit: 17179869184
  locality {
  }
  incarnation: 14339781620792127180
  physical_device_desc: "device: XLA_CPU device", 
name: "/device:GPU:0"
  device_type: "GPU"
  memory_limit: 3464953856
  locality {
    bus_id: 1
    links {
    }
  }
  incarnation: 13743207545082600644
  physical_device_desc: "device: 0, name: GeForce GTX 960M, pci bus id: 0000:02:00.0, compute capability: 5.0"
]

Now as for actually using the GPU for computations. I've used a small piece of code to run some dummy matrix multiplications on the CPUs and on the GPUs, to compare the performance:

shapes = [(50, 50), (100, 100), (500, 500), (1000, 1000), (10000,10000), (15000,15000)]

devices = ['/device:CPU:0', '/device:XLA_GPU:0']

for device in devices:
    for shape in shapes:
        with tf.device(device):
            random_matrix = tf.random_uniform(shape=shape, minval=0, maxval=1)
            dot_operation = tf.matmul(random_matrix, tf.transpose(random_matrix))
            sum_operation = tf.reduce_sum(dot_operation)

        # Time the actual runtime of the operations
        start_time = datetime.now()
        with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as session:
            result = session.run(sum_operation)
        elapsed_time = datetime.now() - start_time

        # PRINT ELAPSED TIME, SHAPE AND DEVICE USED       

Here is the surprise. The first time I run the cell containing this block of code (I'm on a jupyter notebook), the GPU computations take much longer than the CPU:

# output of first run: CPU is faster
----------------------------------------
Input shape: (50, 50) using Device: /device:CPU:0 took: 0.01
Input shape: (100, 100) using Device: /device:CPU:0 took: 0.01
Input shape: (500, 500) using Device: /device:CPU:0 took: 0.01
Input shape: (1000, 1000) using Device: /device:CPU:0 took: 0.02
Input shape: (10000, 10000) using Device: /device:CPU:0 took: 6.22
Input shape: (15000, 15000) using Device: /device:CPU:0 took: 21.23
----------------------------------------
Input shape: (50, 50) using Device: /device:XLA_GPU:0 took: 2.82
Input shape: (100, 100) using Device: /device:XLA_GPU:0 took: 0.17
Input shape: (500, 500) using Device: /device:XLA_GPU:0 took: 0.18
Input shape: (1000, 1000) using Device: /device:XLA_GPU:0 took: 0.20
Input shape: (10000, 10000) using Device: /device:XLA_GPU:0 took: 28.36
Input shape: (15000, 15000) using Device: /device:XLA_GPU:0 took: 93.73
----------------------------------------

Surprise #2: When I rerun the cell containing the dummy matrix multiplication code, the GPU version is much faster (as expected):

# output of reruns: GPU is faster
----------------------------------------
Input shape: (50, 50) using Device: /device:CPU:0 took: 0.02
Input shape: (100, 100) using Device: /device:CPU:0 took: 0.02
Input shape: (500, 500) using Device: /device:CPU:0 took: 0.02
Input shape: (1000, 1000) using Device: /device:CPU:0 took: 0.04
Input shape: (10000, 10000) using Device: /device:CPU:0 took: 6.78
Input shape: (15000, 15000) using Device: /device:CPU:0 took: 24.65
----------------------------------------
Input shape: (50, 50) using Device: /device:XLA_GPU:0 took: 0.14
Input shape: (100, 100) using Device: /device:XLA_GPU:0 took: 0.12
Input shape: (500, 500) using Device: /device:XLA_GPU:0 took: 0.13
Input shape: (1000, 1000) using Device: /device:XLA_GPU:0 took: 0.14
Input shape: (10000, 10000) using Device: /device:XLA_GPU:0 took: 1.64
Input shape: (15000, 15000) using Device: /device:XLA_GPU:0 took: 5.29
----------------------------------------

So my question is: Why is it that only after I run the code once does GPU acceleration actually occur?

I can see the GPU is correctly set up (otherwise no acceleration would happen at all). Is it due to some sort of initial overhead? Do GPUs need to warm-up before we can actually use them?

P.S.: On both runs (i.e. the one where the GPU was slower and the next ones, where the GPU was faster), I could see GPU Usage was 100%, so it was definitely being used.

P.S.: Only in the very first run does it seem the GPU isn't get picked up. If I then run it two, three or multiple times, all runs after the first one are successful (i.e. GPU computation is faster).

talonmies
  • 70,661
  • 34
  • 192
  • 269
Felipe
  • 11,557
  • 7
  • 56
  • 103

1 Answers1

2

robert-crovella's comment made me look into the XLA thing, which helped me find the solution.

Turns out the GPU is mapped to a Tensorflow device in two ways: as XLA device and as a normal GPU.

This is why there were two devices, one named "/device:XLA_GPU:0" and the other "/device:GPU:0".

All I needed to do was to activate "/device:GPU:0" instead. Now the GPU gets picked up by Tensorflow immediately.

Felipe
  • 11,557
  • 7
  • 56
  • 103
  • 1
    It's true that there are two GPU devices and that comparing `CPU` to `XLA_GPU` might not be the best comparison. however I also expect that the answer to your question (why does the first iteration take so much longer) is due to the **JIT** mechanism that is inherent in XLA. The JIT mechanism runs on first usage, and involves many additional processing steps. The code is still executing on the GPU, but the JIT process takes extra time. Thereafter, subsequent calls to the function do not incur the JIT overhead, and run much faster. Reasonably sure this is "expected" behavior. – Robert Crovella Jul 19 '19 at 19:34
  • @RobertCrovella you can write an answer with that and I'll mark it as the right answer – Felipe Jul 20 '19 at 20:16
  • @Felipe In your answer, you mention the GPU gets finally picked up immediately. Does it actually mean the first run is as fast as the subsequent ones? Aside the XLA JIT spot on, I wonder whether there is not also a slow start due to data moving to the GPU memory. – Eric Platon Sep 15 '22 at 02:35