Pytorch's autograd issue with joblib

Question

There seems to be a problem mixing pytorch's autograd with joblib. I need to get gradient in parallel for a lot of samples. Joblib works fine with other aspects of pytorch, however, when mixing with autograd it gives errors. I made a very small example which shows serial version works fine but the parallel version crashes.

from joblib import Parallel, delayed
import numpy as np
torch.autograd.set_detect_anomaly(True)
tt = lambda x, grad=True: torch.tensor(x, requires_grad=grad)

def Grad(X, Out):
    return autograd.grad(Out, X, create_graph=True, allow_unused=False)[0]

xs, ys = [], []
for i in range(10):
    xi = tt(np.random.rand()).float()    
    yi = xi * xi
    xs += [xi]
    ys += [yi]


Grads_serial = [Grad(x, y) for x, y in zip(xs, ys)]
print("Grads_serial", Grads_serial)
Grads_parallel = Parallel(n_jobs=2)([delayed(Grad)(x, y) for x, y in zip(xs, ys)])
print("Grads_parallel", Grads_parallel)

The error message is not very helpful as well:

RuntimeError: One of the differentiated Tensors appears to not have been used in the graph. Set allow_unused=True if this is the desired behavior.

have you tried what the `Error` says? set `allow_unused=True` — CuCaRot, Apr 20 '22 at 08:09

score 1 · Answer 1 · answered Apr 20 '22 at 18:47

The problem is that parallel uses "loky" as a default backend, you should use "threading" as a backend, by this way your code will run as intended, refer to the following documentation about Joblib Parallel class Joblib Parallel Class

So editing your provided code to the following:

from joblib import Parallel, delayed
import numpy as np
import torch
torch.autograd.set_detect_anomaly(True)
tt = lambda x, grad=True: torch.tensor(x, requires_grad=grad)

def Grad(X, Out):
    return torch.autograd.grad(Out, X, create_graph=True, allow_unused=False)[0]

xs, ys = [], []
for i in range(10):
    xi = tt(np.random.rand()).float()    
    yi = xi * xi
    xs += [xi]
    ys += [yi]


Grads_serial = [Grad(x, y) for x, y in zip(xs, ys)]
print("Grads_serial", Grads_serial)
Grads_parallel = Parallel(n_jobs=2, backend="threading")([delayed(Grad)(x, y) for x, y in zip(xs, ys)])
print("Grads_parallel", Grads_parallel)

will give the following results:

Grads_serial [tensor(0.6083, grad_fn=<AddBackward0>), tensor(0.0944, grad_fn=<AddBackward0>), tensor(1.8791, grad_fn=<AddBackward0>), tensor(1.5986, grad_fn=<AddBackward0>), tensor(0.4832, grad_fn=<AddBackward0>), tensor(1.3194, grad_fn=<AddBackward0>), tensor(0.4672, grad_fn=<AddBackward0>), tensor(1.0045, grad_fn=<AddBackward0>), tensor(1.8631, grad_fn=<AddBackward0>), tensor(0.2853, grad_fn=<AddBackward0>)]
Grads_parallel [tensor(0.6083, grad_fn=<AddBackward0>), tensor(0.0944, grad_fn=<AddBackward0>), tensor(1.8791, grad_fn=<AddBackward0>), tensor(1.5986, grad_fn=<AddBackward0>), tensor(0.4832, grad_fn=<AddBackward0>), tensor(1.3194, grad_fn=<AddBackward0>), tensor(0.4672, grad_fn=<AddBackward0>), tensor(1.0045, grad_fn=<AddBackward0>), tensor(1.8631, grad_fn=<AddBackward0>), tensor(0.2853, grad_fn=<AddBackward0>)]

I wish this response will be helpful for you, have a good day.

Thanks, but threading has no performance gain due to GIL, right? — Roy, Apr 20 '22 at 22:02
I quote from the documentation "“threading” is a very low-overhead backend but it suffers from the Python Global Interpreter Lock if the called function relies a lot on Python objects. “threading” is mostly useful when the execution bottleneck is a compiled extension that explicitly releases the GIL (for instance a Cython loop wrapped in a “with nogil” block or an expensive call to a library such as NumPy)." I don't know for sure if the case with PyTorch is similar to the mentioned numpy call, if it is the same then using threading is useful. — mak13, Apr 24 '22 at 22:54

Bob · Accepted Answer · 2022-04-25T10:16:23.137

Joblib is not copying the graph associated with the operations to the different process. One way to work around it is to perform the computation inside the job.

import torch
from torch import autograd
from joblib import Parallel, delayed
import numpy as np
torch.autograd.set_detect_anomaly(False)
tt = lambda x, grad=True: torch.tensor(x, requires_grad=grad)

def Grad(X, Out):
    # This will compute yi in the job, and thus will
    # create the graph here
    yi = Out[0](*Out[1])
    # now the differentiation works
    return autograd.grad(yi, X, create_graph=True, allow_unused=False)[0]

torch.set_num_threads(1)
xs, ys = [], []
for i in range(10):
    xi = tt(np.random.rand()).float()    
    yi = lambda xi: xi * xi, [xi]
    xs += [xi]
    ys += [yi]

Grads_serial = [Grad(x, y) for x, y in zip(xs, ys)]
print("Grads_serial", Grads_serial)
Grads_parallel = Parallel(n_jobs=2)([delayed(Grad)(x, y) for x, y in zip(xs, ys)])
print("Grads_parallel", Grads_parallel)

Edit

Edit #2

The existence of torch.multiprocessing suggests that gradients require some special treatment, you could attempt to write a backend to joblib using torch.multiprocessing instead of multiprocessing or threading.

Here you find an overview to how graphs are constructed in both frameworks

https://www.tensorflow.org/guide/intro_to_graphs

https://pytorch.org/blog/computational-graphs-constructed-in-pytorch/

But I fear that to give a definite answer as to why one works and not the other will have to look into the implementation.

Thank you, this is closer to what I wanted since threading does not provide performance gain due to GIL issue. But I don't understand what is this: Out[0](*Out[1]) ? Is it a function call? — Roy, Apr 20 '22 at 22:06
oh I noticed you changed yi definition. Could you also verify that the reason for tensorflow not having this issue is because it copies graph to each process? — Roy, Apr 20 '22 at 22:14
Tensorflow architecture is a little bit different, unless you are running in eager mode the tensors are nothing more than a node in the computation graph. — Bob, Apr 21 '22 at 04:56
could you please add a few lines (or references) to your answer explaining why this is an issue in pytorch and why doesn't occur in tensorflow. I got some hints and I think it would completes your answer and I can accept it. — Roy, Apr 22 '22 at 02:19

Pytorch's autograd issue with joblib

2 Answers2

Edit

Edit #2

Linked