4

I'm learning to use cupy. But I've found a problem really confusing. It seems that cupy performs well in a program at first. When it runs for a while, Cupy seems be much slower. Here is the code:

import cupy as np
from line_profiler import LineProfiler

def test(ary):
    for i in range(1000):
        ary**6

def main():
    rand=np.random.rand(1024,1024)
    test(rand)
    test(rand)
    test(rand)
    test(rand)
    test(rand)
    test(rand)
    test(rand)

lp = LineProfiler()
lp_wrapper = lp(main)
lp_wrapper()
lp.print_stats()

and here is the time performance:

Timer unit: 2.85103e-07 s

Total time: 16.3308 s
File: E:\Desktop\test.py
Function: main at line 8

Line #      Hits         Time    Per Hit   % Time  Line Contents
==============================================================
     8                                             def main():
     9         1    1528817.0  1528817.0      2.7      rand=np.random.rand(1024,1024)
    10         1     111014.0   111014.0      0.2      test(rand)
    11         1      94528.0    94528.0      0.2      test(rand)
    12         1      95636.0    95636.0      0.2      test(rand)
    13         1      94892.0    94892.0      0.2      test(rand)
    14         1    7728318.0  7728318.0     13.5      test(rand)
    15         1   23872383.0 23872383.0     41.7      test(rand)
    16         1   23754666.0 23754666.0     41.5      test(rand)

When cupy completed 5000 times of power openrations, it becomes very slow.

I ran this code on Windows, and the cuda version is 10.0

Hope for answers. Thanks you very much!


Thanks for your answer! I printed Cupy's memory usage:

import cupy as np

def test(ary):
    mempool = cupy.get_default_memory_pool()
    pinned_mempool = cupy.get_default_pinned_memory_pool()
    for i in range(1000):
        ary**6
    print("used bytes: %s"%mempool.used_bytes())
    print("total bytes: %s\n"%mempool.total_bytes())

def main():
    rand=np.random.rand(1024,1024)
    test(rand)
    test(rand)
    test(rand)
    test(rand)
    test(rand)
    test(rand)
    test(rand)

and here is the output:

used bytes: 8388608
total bytes: 16777216

used bytes: 8388608
total bytes: 16777216

used bytes: 8388608
total bytes: 16777216

used bytes: 8388608
total bytes: 16777216

used bytes: 8388608
total bytes: 16777216

used bytes: 8388608
total bytes: 16777216

used bytes: 8388608
total bytes: 16777216

It seems that the GPU memory usage remain the same during the iteration.

By the way, is there any way to avoid this speed reduction?

Colin
  • 43
  • 4

1 Answers1

5

This is an issue of CUDA kernel queue.

See the following:

The short execution observed in your code was fake, because cupy returns immediately when the queue is not filled.

The actual performance was the last line.

Note: This was NOT an issue of memory allocation — as I originally suggested in my initial answer — but I include the original answer for the record here.


Original (incorrect) answer

May be due to the reallocation.

When you import cupy, cupy allocates "some mount of" GPU memory. When cupy used all of them, it have to allocate more memory. This increases the execution time.

sideshowbarker
  • 81,827
  • 26
  • 193
  • 197
Yuki Hashimoto
  • 1,013
  • 7
  • 19