computing gradients for every individual sample in a batch in PyTorch

Question

I'm trying to implement a version of differentially private stochastic gradient descent (e.g., this), which goes as follows:

Compute the gradient with respect to each point in the batch of size L, then clip each of the L gradients separately, then average them together, and then finally perform a (noisy) gradient descent step.

What is the best way to do this in pytorch?

Preferably, there would be a way to simulataneously compute the gradients for each point in the batch:

x # inputs with batch size L
y #true labels
y_output = model(x)
loss = loss_func(y_output,y) #vector of length L
loss.backward() #stores L distinct gradients in each param.grad, magically

But failing that, compute each gradient separately and then clip the norm before accumulating, but

x # inputs with batch size L
y #true labels
y_output = model(x)
loss = loss_func(y_output,y) #vector of length L   
for i in range(loss.size()[0]):
    loss[i].backward(retain_graph=True)
    torch.nn.utils.clip_grad_norm(model.parameters(), clip_size)

accumulates the ith gradient, and then clips, rather than clipping before accumulating it into the gradient. What's the best way to get around this issue?

Jatentaki · Answer 1 · 2018-12-17T12:42:58.397

I don't think you can do much better than the second method in terms of computational efficiency, you're losing the benefits of batching in your backward and that's a fact. Regarding the order of clipping, autograd stores the gradients in .grad of parameter tensors. A crude solution would be to add a dictionary like

clipped_grads = {name: torch.zeros_like(param) for name, param in net.named_parameters()}

Run your for loop like

for i in range(loss.size(0)):
    loss[i].backward(retain_graph=True)
    torch.nn.utils.clip_grad_norm_(net.parameters())
    for name, param in net.named_parameters():
        clipped_grads[name] += param.grad / loss.size(0)
    net.zero_grad()

for name, param in net.named_parameters():
    param.grad = clipped_grads[name]

optimizer.step()

where I omitted much of the detach, requires_grad=False and similar business which may be necessary to make it behave as expected.

The disadvantage of the above is that you end up storing 2x the memory for your parameter gradients. In principle you could take the "raw" gradient, clip it, add to clipped_gradient, and then discard as soon as no downstream operations need it, whereas here you retain the raw values in grad until the end of a backward pass. It may be that register_backward_hook allows you to do that if you go against the guidelines and actually modify the grad_input, but you would have to verify with someone more intimately acquaintanced with autograd.

score 1 · Answer 2 · edited Nov 10 '20 at 02:51

1

This package calculates per-sample gradient in parallel. The memory needed is still batch_size times of standard stochastic gradient descent, but due to parallelization it can run much faster.

edited Nov 10 '20 at 02:51

Cheong Sik Feng

121
1
7

answered Dec 15 '19 at 07:37

Tao

11
1

Where? I see a link to a differential privacy library, it might do that, but where and how? – borgr Jan 05 '22 at 15:23

computing gradients for every individual sample in a batch in PyTorch

2 Answers2