Using only driver api, for example, I have a profiling with single process below(cuCtxCreate), cuCtxCreate overhead is nearly comparable to 300MB data copy to/from GPU:
In CUDA documentation here, it says(for cuDevicePrimaryCtxRetain) Retains the primary context on the device, creating it **if necessary**
. Is this an expected behavior for repeated calls to same process from command line(such as running a process 1000 times for explicitly processing 1000 different input images)? Does device need CU_COMPUTEMODE_EXCLUSIVE_PROCESS to work as intended(re-use same context when called multiple times)?
For now, upper image is same even if I call that process multiple times. Even without using profiler, timings show around 1second completion time.
Edit: According the documentation, primary context is one per device per process
. Does this mean there won't be a problem when using multiple threaded single application?
What is re-use time limit for primary context? Is 1 second between processes okay or does it have to be miliseconds to keep primary context alive?
I'm already caching ptx codes into a file so the only remaining overhead looks like cuMemAlloc(), malloc() and cuMemHostRegister()
so re-using latest context from last call to same process would optimize timings good.
Edit-2: Documentation says The caller must call cuDevicePrimaryCtxRelease() when done using the context.
for cuDevicePrimaryCtxRetain
. Is caller here any process? Can I just use retain in first called process and use release on the last called process in a list of hundreds of sequentally called processes? Does system need a reset if last process couldn't be launched and cuDevicePrimaryCtxRelease
not called?
Edit-3:
Is primary context intended for this?
process-1: retain (creates)
process-2: retain (re-uses)
...
process-99: retain (re-uses)
process-100: 1 x retain and 100 x release (to decrease counter and unload at last)
- Everything is compiled for sm_30 and device is Grid K520.
- GPU was at boost frequency during cuCtxCreate()
- Project was 64-bit(release mode) compiled on a windows server 2016 OS and CUDA driver installation with windows-7 compatibility(this was the only way working for K520 + windows_server_2016)