1

A follow-up of this question: Mxnet - slow array copy to GPU

Question: mxnet GPU initialization takes around 20 seconds. How can I fix it?

I have the following code:

import mxnet as mx
import mxnet.ndarray as nd

from mxnet import profiler

profiler.set_config(aggregate_stats=True)

ctx = mx.gpu()

profiler.set_state('run')
nd.random.uniform(-1, 1, shape=(1, 1), ctx=ctx)
nd.waitall()
profiler.set_state('stop')
print(profiler.dumps(reset=True))

And this is the profiler output:

Device Storage
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
Memory: gpu/0                           3           0.0080           0.0040           0.0120           0.0040

MXNET_C_API
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
MXNDArrayWaitAll                        1           2.0640           2.0640           2.0640           2.0640
MXNDArrayFree                           1           0.0010           0.0010           0.0010           0.0010
MXImperativeInvokeEx                    1       22197.0469       22197.0469       22197.0469       22197.0469
MXNet C API Concurrency                 6           0.0000           0.0000           0.0010           0.0005
MXNet C API Calls                       3           0.0030           0.0010           0.0030           0.0010

operator
=================
Name                          Total Count        Time (ms)    Min Time (ms)    Max Time (ms)    Avg Time (ms)
----                          -----------        ---------    -------------    -------------    -------------
_random_uniform                         2           1.6280           0.8130           0.8150           0.8140
DeleteVariable                          2           0.0130           0.0060           0.0070           0.0065
ResourceParallelRandomSetSeed              10          17.9840           0.4670           5.7260           1.7984

So, it takes 22 seconds. Any operation after this will be very fast, but the first GPU operation takes 22 seconds (no matter what operation I use). So it's likely that initialization requires a long time. How can I fix it?

I tried this approach https://github.com/apache/incubator-mxnet/issues/3239:

export CUDA_CACHE_MAXSIZE=2147483647
export CUDA_CACHE_DISABLE=0
export CUDA_CACHE_PATH="my_home_path/.nv/ComputeCache"

but it doesn't work.

This page: https://github.com/apache/incubator-mxnet/issues/13040 also mentions SM_70 arch.wait update, but I wasn't able to understand what it means.

As I understand, the problem is that some libraries have to be loaded to GPU, and it just has to be done. The idea is to cache these libraries (so that there is no load second time), but I don't know how to do this.

1 Answers1

1

I installed cuda 10.1 (previously it was 9.1) and mxnet-cu101mkl, and now the initialization time is 2.5 seconds. Since I reinstalled almost every cuda-related component, I don't know what exactly helped.