4

There seems to be a problem mixing pytorch's autograd with joblib. I need to get gradient in parallel for a lot of samples. Joblib works fine with other aspects of pytorch, however, when mixing with autograd it gives errors. I made a very small example which shows serial version works fine but the parallel version crashes.

from joblib import Parallel, delayed
import numpy as np
torch.autograd.set_detect_anomaly(True)
tt = lambda x, grad=True: torch.tensor(x, requires_grad=grad)

def Grad(X, Out):
    return autograd.grad(Out, X, create_graph=True, allow_unused=False)[0]

xs, ys = [], []
for i in range(10):
    xi = tt(np.random.rand()).float()    
    yi = xi * xi
    xs += [xi]
    ys += [yi]


Grads_serial = [Grad(x, y) for x, y in zip(xs, ys)]
print("Grads_serial", Grads_serial)
Grads_parallel = Parallel(n_jobs=2)([delayed(Grad)(x, y) for x, y in zip(xs, ys)])
print("Grads_parallel", Grads_parallel)

The error message is not very helpful as well:

RuntimeError: One of the differentiated Tensors appears to not have been used in the graph. Set allow_unused=True if this is the desired behavior.
Roy
  • 65
  • 2
  • 15
  • 40

2 Answers2

1

The problem is that parallel uses "loky" as a default backend, you should use "threading" as a backend, by this way your code will run as intended, refer to the following documentation about Joblib Parallel class Joblib Parallel Class

So editing your provided code to the following:

from joblib import Parallel, delayed
import numpy as np
import torch
torch.autograd.set_detect_anomaly(True)
tt = lambda x, grad=True: torch.tensor(x, requires_grad=grad)

def Grad(X, Out):
    return torch.autograd.grad(Out, X, create_graph=True, allow_unused=False)[0]

xs, ys = [], []
for i in range(10):
    xi = tt(np.random.rand()).float()    
    yi = xi * xi
    xs += [xi]
    ys += [yi]


Grads_serial = [Grad(x, y) for x, y in zip(xs, ys)]
print("Grads_serial", Grads_serial)
Grads_parallel = Parallel(n_jobs=2, backend="threading")([delayed(Grad)(x, y) for x, y in zip(xs, ys)])
print("Grads_parallel", Grads_parallel)

will give the following results:

Grads_serial [tensor(0.6083, grad_fn=<AddBackward0>), tensor(0.0944, grad_fn=<AddBackward0>), tensor(1.8791, grad_fn=<AddBackward0>), tensor(1.5986, grad_fn=<AddBackward0>), tensor(0.4832, grad_fn=<AddBackward0>), tensor(1.3194, grad_fn=<AddBackward0>), tensor(0.4672, grad_fn=<AddBackward0>), tensor(1.0045, grad_fn=<AddBackward0>), tensor(1.8631, grad_fn=<AddBackward0>), tensor(0.2853, grad_fn=<AddBackward0>)]
Grads_parallel [tensor(0.6083, grad_fn=<AddBackward0>), tensor(0.0944, grad_fn=<AddBackward0>), tensor(1.8791, grad_fn=<AddBackward0>), tensor(1.5986, grad_fn=<AddBackward0>), tensor(0.4832, grad_fn=<AddBackward0>), tensor(1.3194, grad_fn=<AddBackward0>), tensor(0.4672, grad_fn=<AddBackward0>), tensor(1.0045, grad_fn=<AddBackward0>), tensor(1.8631, grad_fn=<AddBackward0>), tensor(0.2853, grad_fn=<AddBackward0>)]

I wish this response will be helpful for you, have a good day.

mak13
  • 146
  • 4
  • 1
    Thanks, but threading has no performance gain due to GIL, right? – Roy Apr 20 '22 at 22:02
  • I quote from the documentation "“threading” is a very low-overhead backend but it suffers from the Python Global Interpreter Lock if the called function relies a lot on Python objects. “threading” is mostly useful when the execution bottleneck is a compiled extension that explicitly releases the GIL (for instance a Cython loop wrapped in a “with nogil” block or an expensive call to a library such as NumPy)." I don't know for sure if the case with PyTorch is similar to the mentioned numpy call, if it is the same then using threading is useful. – mak13 Apr 24 '22 at 22:54
1

Joblib is not copying the graph associated with the operations to the different process. One way to work around it is to perform the computation inside the job.

import torch
from torch import autograd
from joblib import Parallel, delayed
import numpy as np
torch.autograd.set_detect_anomaly(False)
tt = lambda x, grad=True: torch.tensor(x, requires_grad=grad)

def Grad(X, Out):
    # This will compute yi in the job, and thus will
    # create the graph here
    yi = Out[0](*Out[1])
    # now the differentiation works
    return autograd.grad(yi, X, create_graph=True, allow_unused=False)[0]

torch.set_num_threads(1)
xs, ys = [], []
for i in range(10):
    xi = tt(np.random.rand()).float()    
    yi = lambda xi: xi * xi, [xi]
    xs += [xi]
    ys += [yi]

Grads_serial = [Grad(x, y) for x, y in zip(xs, ys)]
print("Grads_serial", Grads_serial)
Grads_parallel = Parallel(n_jobs=2)([delayed(Grad)(x, y) for x, y in zip(xs, ys)])
print("Grads_parallel", Grads_parallel)

Edit

More philosophical questions are

(1) does it make sense to use joblib parallelism, if you can simply vectorize your operations and let torch to use intraoperator parallelism?

(2) mak14 mentioned using threading backend, it is good that it fixes your example. But multiple threads will use only one CPU, it makes sense for IO bounded jobs, like making HTTP requests, but not for CPU bounded operations.

Edit #2

The existence of torch.multiprocessing suggests that gradients require some special treatment, you could attempt to write a backend to joblib using torch.multiprocessing instead of multiprocessing or threading.

Here you find an overview to how graphs are constructed in both frameworks

https://www.tensorflow.org/guide/intro_to_graphs

https://pytorch.org/blog/computational-graphs-constructed-in-pytorch/

But I fear that to give a definite answer as to why one works and not the other will have to look into the implementation.

Bob
  • 13,867
  • 1
  • 5
  • 27
  • Thank you, this is closer to what I wanted since threading does not provide performance gain due to GIL issue. But I don't understand what is this: Out[0](*Out[1]) ? Is it a function call? – Roy Apr 20 '22 at 22:06
  • oh I noticed you changed yi definition. Could you also verify that the reason for tensorflow not having this issue is because it copies graph to each process? – Roy Apr 20 '22 at 22:14
  • 1
    Tensorflow architecture is a little bit different, unless you are running in eager mode the tensors are nothing more than a node in the computation graph. – Bob Apr 21 '22 at 04:56
  • could you please add a few lines (or references) to your answer explaining why this is an issue in pytorch and why doesn't occur in tensorflow. I got some hints and I think it would completes your answer and I can accept it. – Roy Apr 22 '22 at 02:19
  • 1
    Hi Roy, I tried to expand a little bit more the answer. – Bob Apr 25 '22 at 10:17