0

I want a remote class to keep another remote class so that it can invoke it later. The following code provide an example

import ray

ray.init()
@ray.remote
class Worker:
    def __init__(self):
        self.a = 1
        self.l = None

    def set(self, learner):
        self.l = learner

    def step(self):
        x = ray.get(self.l.p.remote(self.a))
        return x

@ray.remote
class Learner:
    def __init__(self):
        self.a = 3

    def step(self, worker):
        print(ray.get(worker.step.remote()))

    def p(self, a):
        return a + self.a

l = Learner.remote()
w = Worker.remote()
w.set.remote(l)
ray.get(l.step.remote(w))
ray.shutdown()

However, this code does not work; it gets stuck without emitting any error. I know the source of the problem comes from the step function in Worker, but I don't know why it is wrong and how to fix it.

Maybe
  • 2,129
  • 5
  • 25
  • 45

1 Answers1

1

Firstly, note that ray.get is a blocking call. This means that your program will be blocked and cannot go to the next line of code until ray.get function is succeeded. (You can prevent this by adding a timeout argument to remote function).

This happens because l is blocked until worker.step.remote is done (ray.get(worker.step.remote()). When worker.step method is called, it tries to call l.p.remote. w will blocked until l.p is done because of ray.get(self.l.p.remote(self.a). But as you can see, l is blocked and cannot run any code. It means that l.p will never run until l.step is done. Here is a simple diagram for your understanding.

enter image description here

Now both workers are blocked and l.step.remote will never be done. That means your driver (Python script) is also blocked.

As a result, the whole program is hang!!

Then how to solve this problem?

Firstly, I highly discourage you to use the pattern that two actor classes are waiting for each other. This is generally a bad pattern even when you are writing other programs. This can be solved when programs are multi-threaded or asynchronous.

If you really need to use this pattern, you can use the async actor. Async actor uses await instead of ray.get, and each actors are not blocked because they are running as coroutine.

https://ray.readthedocs.io/en/latest/async_api.html

EX)

import ray

ray.init()
@ray.remote
class Worker:
    def __init__(self):
        self.a = 1
        self.l = None

    def set(self, learner):
        self.l = learner

    async def step(self):
        x = await self.l.p.remote(self.a)
        return x

@ray.remote
class Learner:
    def __init__(self):
        self.a = 3

    async def step(self, worker):
        print(await worker.step.remote())

    async def p(self, a):
        return a + self.a

l = Learner.remote()
w = Worker.remote()
w.set.remote(l)
await l.step.remote(w)
# ray.shutdown() 
Sang
  • 885
  • 5
  • 4