Is cuDevicePrimaryCtxRetain() used for having persistent CUDA context objects between multiple processes?

Question

Using only driver api, for example, I have a profiling with single process below(cuCtxCreate), cuCtxCreate overhead is nearly comparable to 300MB data copy to/from GPU:

In CUDA documentation here, it says(for cuDevicePrimaryCtxRetain) Retains the primary context on the device, creating it **if necessary**. Is this an expected behavior for repeated calls to same process from command line(such as running a process 1000 times for explicitly processing 1000 different input images)? Does device need CU_COMPUTEMODE_EXCLUSIVE_PROCESS to work as intended(re-use same context when called multiple times)?

For now, upper image is same even if I call that process multiple times. Even without using profiler, timings show around 1second completion time.

Edit: According the documentation, primary context is one per device per process. Does this mean there won't be a problem when using multiple threaded single application?

What is re-use time limit for primary context? Is 1 second between processes okay or does it have to be miliseconds to keep primary context alive?

I'm already caching ptx codes into a file so the only remaining overhead looks like cuMemAlloc(), malloc() and cuMemHostRegister() so re-using latest context from last call to same process would optimize timings good.

Edit-2: Documentation says The caller must call cuDevicePrimaryCtxRelease() when done using the context. for cuDevicePrimaryCtxRetain. Is caller here any process? Can I just use retain in first called process and use release on the last called process in a list of hundreds of sequentally called processes? Does system need a reset if last process couldn't be launched and cuDevicePrimaryCtxRelease not called?

Edit-3:

Is primary context intended for this?

process-1: retain (creates)
process-2: retain (re-uses)
...
process-99: retain (re-uses)
process-100: 1 x retain and 100 x release (to decrease counter and unload at last)

Everything is compiled for sm_30 and device is Grid K520.
GPU was at boost frequency during cuCtxCreate()
Project was 64-bit(release mode) compiled on a windows server 2016 OS and CUDA driver installation with windows-7 compatibility(this was the only way working for K520 + windows_server_2016)

There is no such thing as a "persistent" context under any circumstances with the driver API, and you have, as far as I can tel, basically misunderstood everything about what contexts are, what the context APIs are designed to do, and how you should use them — talonmies, Jan 28 '18 at 13:54
What I understand is, a context is given for each GPU, it controls all operations such as memory allocations, kernel calls, memory copies and takes time to create and destroy, seems to be better to do 1000s of operations per context creation-destruction. I just asked if primary context can be used to bypass this creation - destruction so I don't need to compute 1000s of things per process . — huseyin tugrul buyukisik, Jan 28 '18 at 14:31
A context is given for each *process* for each GPU. There is a major difference. — talonmies, Jan 28 '18 at 14:55
I tried primary context yesterday, it didn't change latencies. Still 150ms for starting. — huseyin tugrul buyukisik, Jan 28 '18 at 14:57
Yes because the primary context is intended to be used for something completely different from what you are imagining. It is intended to allow the driver API to bind to an existing runtime API context. Nothing more than that — talonmies, Jan 28 '18 at 15:09
Do you mean, if I start using runtime API libraries such as FFT functions, then I can use same context so that driver can schedule my concurrent kernels better with that FFT for example? — huseyin tugrul buyukisik, Jan 28 '18 at 16:11

score 3 · Accepted Answer · edited Feb 03 '21 at 21:04

tl;dr: No, it is not.

Is cuDevicePrimaryCtxRetain() used for having persistent CUDA context objects between multiple processes?

No. It is intended to allow the driver API to bind to a context which a library which has used the runtime API has already lazily created. Nothing more than that. Once upon a time it was necessary to create contexts with the driver API and then have the runtime bind to them. Now, with these APIs, you don't have to do that. You can, for example, see how this is done in Tensorflow here.

Does this mean there won't be a problem when using multiple threaded single application?

The driver API has been fully thread safe since about CUDA 2.0

Is caller here any process? Can I just use retain in first called process and use release on the last called process in a list of hundreds of sequentally [sic] called processes?

No. Contexts are always unique to a given process. They can't be shared between processes in this way

Is primary context intended for this?

process-1: retain (creates)
process-2: retain (re-uses)
...
process-99: retain (re-uses)
process-100: 1 x retain and 100 x release (to decrease counter and unload at last)

No.

Is cuDevicePrimaryCtxRetain() used for having persistent CUDA context objects between multiple processes?

1 Answers1

tl;dr: No, it is not.

Linked