0

I now plan to make Checkpoint/Restart library for CUDA application with BLCR.

To do this, I have to destroy the CUDA application completely during process running.

Because, BLCR be failed to run cr_checkpoint if process remains on GPU. Actually, I tried to call cudaDeviceReset() at some point and after that call sleep(1000), during the sleep system call I sent the signal like this; cr_checkpoint PID. The case, I succeeded to create context.PID file but failed to run like this; cr_run context.PID. Error Message is as follows;

-mmap(0, 200000000, 2700000000, ...) = 0xfffffffffffffff4 (failed) -thaw_threads returned error, aborting. -12 Restart failed: Cannot allocate memory

Does anyone have any idea for this? Summary is as follows.

  1. I plan to make Checkpoint/Restart library for CUDA applications with BLCR.
  2. I tried to call cudaDeviceReset() function, but it failed to restart (succeeded to create context.PID file but failed to restart)
  3. I want to know how to destroy or reset CUDA applications completely during process running.

I would appliciate it if anyone gave any idea for me.

Community
  • 1
  • 1
user2779344
  • 220
  • 1
  • 10

1 Answers1

3

cudaDeviceReset() does destroy the device side of any CUDA application completely, including stopping the running code, resetting the GPU, and deleting any device memory allocations. It doesn't stop the host portion of the application, or affect it except for the allocations I mentioned.

I don't know that it destroys all cuda contexts, however. (It may, I simply don't know.) I think you may have interpreted the failure of cr_run incorrectly. You may want to read this (unfortunately the paper is now behind a paywall). It's possible that you still have an extant CUDA context at the point of your cr_checkpoint.

You might want to use driver API functions to explicitly manage and destroy any CUDA contexts.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • @Robert Crovella I'm having this exact problem, the website you pointed to for recommended reading now gives a 404 error. Might you still have the title of that resource or an updated link? It would be much appreciated. Thanks! – John Oct 01 '17 at 22:46
  • 1
    I've updated the link to point to the correct UPDAS paper (you can see the link [here](http://www.sc.isc.tohoku.ac.jp/~tacky/). Unfortunately it is now behind a paywall. If you or your org has ACM membership, you should be able to access it. – Robert Crovella Oct 01 '17 at 22:53