4

When I use the Ray with pytorch, I do not set any num_gpus flag for the remote class.

I get the following error:

RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. 

The main process is: I create a remote class and transfer a pytorch model state_dict()(created in main function) to it. In the main function, the torch.cuda.is_available() is True, but In the remote function, torch.cuda.is_available() is False. Thanks

I try to set the num_gpus=1 and got a new issue: the program just got stuck. Below is the minimal example code for reproducing this issue. Thanks.

import ray


@ray.remote(num_gpus=1)
class Worker(object):
    def __init__(self, args):
        self.args = args
        self.gen_frames = 0

    def set_gen_frames(self, value):
        self.gen_frames = value
        return self.gen_frames

    def get_gen_num(self):
        return self.gen_frames


class Parameters:
    def __init__(self):
        self.is_cuda = False;
        self.is_memory_cuda = True
        self.pop_size = 10


if __name__ == "__main__":
    ray.init()
    args = Parameters()
    workers = [Worker.remote(args) for _ in range(args.pop_size)]
    get_num_ids = [worker.get_gen_num.remote() for worker in workers]
    gen_nums = ray.get(get_num_ids)
    print(gen_nums)
Han Zheng
  • 309
  • 2
  • 8
  • I'd suggest posting a minimal example that can be run. However, it sounds like you need to use `@ray.remote(num_gpus=1)`. Why are you not using this flag? – Robert Nishihara Jan 31 '19 at 00:01
  • Because the Ray tutorial says Ray will detect the available GPUs automatically. And I try to set num_gpus=1, and I got another issue, the program just stuck. I will update my question to upload a minimal code to reproduce this problem. – Han Zheng Jan 31 '19 at 00:29
  • The call to `ray.init()` should automatically detect that the *machine* has GPUs available, but tasks will not have GPUs reserved for them unless they explicitly require them in the `@ray.remote` decorator. – Robert Nishihara Feb 02 '19 at 02:08
  • Got it. Thanks. – Han Zheng Feb 02 '19 at 10:10

1 Answers1

6

If you also want to deploy the model on a gpu, you need to make sure that your actor or task indeed has access to a gpu (with @ray.remote(num_gpus=1), this will make sure that torch.cuda.is_available() will be true in that remote function). If you want to deploy your model on a CPU, you need to specify that when loading the model, see for example https://github.com/pytorch/pytorch/issues/9139.

Philipp Moritz
  • 241
  • 3
  • 3
  • I try this and got a new issue, see my edited question. Thanks. – Han Zheng Jan 31 '19 at 00:34
  • Ah, it might be that our automatic gpu detection is not working for you, what is the output of `ls /proc/driver/nvidia/gpus` (which platform are you on)? Also can you try ray.init(num_gpus=1)? – Philipp Moritz Jan 31 '19 at 00:39
  • The output of `ls /proc/driver/nvidia/gpus` is `0000:03:00.0 0000:82:00.0`, and `ray.init(num_gpus=1)` still get the same issue. – Han Zheng Jan 31 '19 at 00:45
  • My platform is redhat 7.3. – Han Zheng Jan 31 '19 at 00:49
  • It looks like you only have one GPU, but your program requires args.pop_size many gpus to run, I think that's why it is hanging. Does that sound correct? – Philipp Moritz Jan 31 '19 at 00:51
  • Thanks. When I set the `args.pop_size=1`, it works. So each worker need the GPUs with the size of `num_gpus`? I have two GPUs, and I just want these workers to share this two GPUs or one GPU instead of one worker with one GPU. Any suggestions? Thanks. – Han Zheng Jan 31 '19 at 01:05
  • Yes, you can also use fractional gpus, i.e. @ray.remote(num_gpus=0.2) to share 2 gpus among 10 workers. – Philipp Moritz Jan 31 '19 at 01:14
  • Thanks. Even though I have two GPUs, Ray seems only detect one? When I set num_gpus=0.2, it is hanging again. Do I need to do some extra setting? – Han Zheng Jan 31 '19 at 01:27
  • I checked again, Ray used two GPUs. Thanks again. – Han Zheng Jan 31 '19 at 04:34