How to use GPUs with Ray in Pytorch? Should I specify the num_gpus for the remote class?

Question

When I use the Ray with pytorch, I do not set any num_gpus flag for the remote class.

I get the following error:

RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False.

The main process is: I create a remote class and transfer a pytorch model state_dict()(created in main function) to it. In the main function, the torch.cuda.is_available() is True, but In the remote function, torch.cuda.is_available() is False. Thanks

I try to set the num_gpus=1 and got a new issue: the program just got stuck. Below is the minimal example code for reproducing this issue. Thanks.

import ray


@ray.remote(num_gpus=1)
class Worker(object):
    def __init__(self, args):
        self.args = args
        self.gen_frames = 0

    def set_gen_frames(self, value):
        self.gen_frames = value
        return self.gen_frames

    def get_gen_num(self):
        return self.gen_frames


class Parameters:
    def __init__(self):
        self.is_cuda = False;
        self.is_memory_cuda = True
        self.pop_size = 10


if __name__ == "__main__":
    ray.init()
    args = Parameters()
    workers = [Worker.remote(args) for _ in range(args.pop_size)]
    get_num_ids = [worker.get_gen_num.remote() for worker in workers]
    gen_nums = ray.get(get_num_ids)
    print(gen_nums)

I'd suggest posting a minimal example that can be run. However, it sounds like you need to use `@ray.remote(num_gpus=1)`. Why are you not using this flag? — Robert Nishihara, Jan 31 '19 at 00:01
Because the Ray tutorial says Ray will detect the available GPUs automatically. And I try to set num_gpus=1, and I got another issue, the program just stuck. I will update my question to upload a minimal code to reproduce this problem. — Han Zheng, Jan 31 '19 at 00:29
The call to `ray.init()` should automatically detect that the *machine* has GPUs available, but tasks will not have GPUs reserved for them unless they explicitly require them in the `@ray.remote` decorator. — Robert Nishihara, Feb 02 '19 at 02:08

score 6 · Answer 1 · answered Jan 31 '19 at 00:12

6

If you also want to deploy the model on a gpu, you need to make sure that your actor or task indeed has access to a gpu (with @ray.remote(num_gpus=1), this will make sure that torch.cuda.is_available() will be true in that remote function). If you want to deploy your model on a CPU, you need to specify that when loading the model, see for example https://github.com/pytorch/pytorch/issues/9139.

answered Jan 31 '19 at 00:12

Philipp Moritz

241
3
3

I try this and got a new issue, see my edited question. Thanks. – Han Zheng Jan 31 '19 at 00:34
Ah, it might be that our automatic gpu detection is not working for you, what is the output of `ls /proc/driver/nvidia/gpus` (which platform are you on)? Also can you try ray.init(num_gpus=1)? – Philipp Moritz Jan 31 '19 at 00:39
The output of `ls /proc/driver/nvidia/gpus` is `0000:03:00.0 0000:82:00.0`, and `ray.init(num_gpus=1)` still get the same issue. – Han Zheng Jan 31 '19 at 00:45
My platform is redhat 7.3. – Han Zheng Jan 31 '19 at 00:49
It looks like you only have one GPU, but your program requires args.pop_size many gpus to run, I think that's why it is hanging. Does that sound correct? – Philipp Moritz Jan 31 '19 at 00:51
Thanks. When I set the `args.pop_size=1`, it works. So each worker need the GPUs with the size of `num_gpus`? I have two GPUs, and I just want these workers to share this two GPUs or one GPU instead of one worker with one GPU. Any suggestions? Thanks. – Han Zheng Jan 31 '19 at 01:05
Yes, you can also use fractional gpus, i.e. @ray.remote(num_gpus=0.2) to share 2 gpus among 10 workers. – Philipp Moritz Jan 31 '19 at 01:14
Thanks. Even though I have two GPUs, Ray seems only detect one? When I set num_gpus=0.2, it is hanging again. Do I need to do some extra setting? – Han Zheng Jan 31 '19 at 01:27
I checked again, Ray used two GPUs. Thanks again. – Han Zheng Jan 31 '19 at 04:34

How to use GPUs with Ray in Pytorch? Should I specify the num_gpus for the remote class?

1 Answers1