A follow-up of this question: Mxnet - slow array copy to GPU
Question: mxnet GPU initialization takes around 20 seconds. How can I fix it?
I have the following code:
import mxnet as mx
import mxnet.ndarray as nd
from mxnet import profiler
profiler.set_config(aggregate_stats=True)
ctx = mx.gpu()
profiler.set_state('run')
nd.random.uniform(-1, 1, shape=(1, 1), ctx=ctx)
nd.waitall()
profiler.set_state('stop')
print(profiler.dumps(reset=True))
And this is the profiler output:
Device Storage
=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
Memory: gpu/0 3 0.0080 0.0040 0.0120 0.0040
MXNET_C_API
=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
MXNDArrayWaitAll 1 2.0640 2.0640 2.0640 2.0640
MXNDArrayFree 1 0.0010 0.0010 0.0010 0.0010
MXImperativeInvokeEx 1 22197.0469 22197.0469 22197.0469 22197.0469
MXNet C API Concurrency 6 0.0000 0.0000 0.0010 0.0005
MXNet C API Calls 3 0.0030 0.0010 0.0030 0.0010
operator
=================
Name Total Count Time (ms) Min Time (ms) Max Time (ms) Avg Time (ms)
---- ----------- --------- ------------- ------------- -------------
_random_uniform 2 1.6280 0.8130 0.8150 0.8140
DeleteVariable 2 0.0130 0.0060 0.0070 0.0065
ResourceParallelRandomSetSeed 10 17.9840 0.4670 5.7260 1.7984
So, it takes 22 seconds. Any operation after this will be very fast, but the first GPU operation takes 22 seconds (no matter what operation I use). So it's likely that initialization requires a long time. How can I fix it?
I tried this approach https://github.com/apache/incubator-mxnet/issues/3239:
export CUDA_CACHE_MAXSIZE=2147483647
export CUDA_CACHE_DISABLE=0
export CUDA_CACHE_PATH="my_home_path/.nv/ComputeCache"
but it doesn't work.
This page: https://github.com/apache/incubator-mxnet/issues/13040 also mentions SM_70 arch.wait update
, but I wasn't able to understand what it means.
As I understand, the problem is that some libraries have to be loaded to GPU, and it just has to be done. The idea is to cache these libraries (so that there is no load second time), but I don't know how to do this.