1

Given a pool of GPUs we check if a GPU is free by checking if no process is running on it.

The problem is that our process does not immediately require the GPU. Therefore there is a chance that two processes get assigned the same GPU.

Is it possible to lock / reserve a GPU for a specific process? Via shell?

The GPU should only be usable by the running process until it finishes, then the GPU should be free again.

Spenhouet
  • 6,556
  • 12
  • 51
  • 76
  • 1
    this is a standard capability of many GPU-aware job schedulers such as SLURM. – Robert Crovella Jun 25 '20 at 15:57
  • @RobertCrovella Yes, I know. We are using Slurm on our cluster. The above is a special use-case in a special scenario. I will have to solve it via a shell script. Don't worry, I'm not going to reinvent resource scheduling. The above question still stands. – Spenhouet Jun 25 '20 at 16:49
  • The premise of the question makes no sense. You want to "reserve" a GPU for a process that doesn't yet exist. How could any existing process (like a shell) know the PID of a future process? – talonmies Jun 25 '20 at 17:42
  • @talonmies The process exists, it just not started using the GPU. – Spenhouet Jun 25 '20 at 17:49
  • So just design you application properly so that the first thing it does is call cudaFree and there is no problem to solve – talonmies Jun 25 '20 at 17:55
  • @talonmies What do you mean with cudaFree? That seems to be a command from the C++ interface to free up memory for variables. It is not clear how that should help when there is no assigned variable yet, how that should block the GPU and what that has to do with "proper application design" (quite contrary this looks like an anti pattern). Could you give an example (shell or python command)? – Spenhouet Jun 26 '20 at 06:09
  • https://stackoverflow.com/q/10415204/681865 – talonmies Jun 26 '20 at 06:16
  • @talonmies The actual process that should get the context is a python script. I don't seem to have access to cudaFree. I can allocate some memory (with `torch.Tensor([0]).to('cuda:0')` and that should do the trick but this definitely is a hack and not a good practice. – Spenhouet Jun 26 '20 at 07:43
  • Another detail you failed to mention. The driver can make a process exclusive device selection, but that is triggered by context creation by the running process. Your application needs to create that context, and it needs to happen at the beginning of the lifecycle if you want that exclusively to be triggered. How you choose to do that is your business. What you decide is or is not good practice is also you business. Several posters have been patiently explaining how this works. If you choose to reject that reality and replace it with your own, that is also your business. – talonmies Jun 26 '20 at 08:01
  • @talonmies I'm not sure where your anger is coming from and why you constantly feel the need to attack. Maybe don't assume bad intent. My question clearly defined that I want to lock the GPU for a process (existing) via shell. Talking about Slurm, cudaFree, exclusive compute mode and context initialization by the actual process is just not an answer to the question. The answer would have been: "That is not possible. There is no build in tool to do that.". I still learned that from the replies and will apply the above mentioned hack as a workaround. – Spenhouet Jun 26 '20 at 08:52

1 Answers1

3

Assuming the GPU is an Nvidia one (inferred by the tags) its a similar answer to: https://stackoverflow.com/a/50056586/6857772

Answer to your question

sudo nvidia-smi -c 3

will put the device into exclusive compute mode allowing only a single process to create a context on the device.

Note that the process itself must actually create a context, ideally at startup, for this to be effective. How you do that depends on what the process itself is and what API family it uses to interface to CUDA (i.e. runtime or driver API, or some level of abstraction on built on top of the runtime or driver API). There is no way for another process to do this on the GPU processes behalf.

talonmies
  • 70,661
  • 34
  • 192
  • 269
James Sharpe
  • 524
  • 2
  • 8
  • Sure, I was aware of the explusive mode, but as long as there is no context the GPU is free for everyone. Is it possible to take context without doing anything? If this is not possible, then the exclusive mode does not help. – Spenhouet Jun 25 '20 at 16:48
  • Restricted group ownership of the device nodes for the device could help or cgroups which I believe is how slurm does it – James Sharpe Jun 25 '20 at 16:55
  • I will have to look up groups / cgroups tomorrow. Do you know if they allow ownership on a process level? – Spenhouet Jun 25 '20 at 17:50
  • 1
    If you want ownership when the process starts, why not create the context when the process starts? – cwharris Jun 25 '20 at 19:57
  • Also, curious, what's the use case? – cwharris Jun 25 '20 at 19:58
  • @cwharris What do you mean with "create the context"? I'm not aware of such a command. Imaginary command: `nvidia-smi --create_context --process_id 123` – Spenhouet Jun 26 '20 at 06:02
  • @Spenhouet: The process *itself* must create the context – talonmies Jun 26 '20 at 06:24
  • I can initiate (fake) the context by allocating some memory with `torch.Tensor([0]).to('cuda:0')`. That is definitely a solution (just not a god one, but to be fair, my requirement / use-case is also not best practice to beginn with). – Spenhouet Jun 26 '20 at 07:45
  • I marked this as answer based on the last sentence "There is no way for another process to [create a context] on the GPU processes behalf." added by @talonmies since this was the information I was looking for. – Spenhouet Jun 26 '20 at 14:32