I'm trying to implement a version of differentially private stochastic gradient descent (e.g., this), which goes as follows:
Compute the gradient with respect to each point in the batch of size L, then clip each of the L gradients separately, then average them together, and then finally perform a (noisy) gradient descent step.
What is the best way to do this in pytorch?
Preferably, there would be a way to simulataneously compute the gradients for each point in the batch:
x # inputs with batch size L
y #true labels
y_output = model(x)
loss = loss_func(y_output,y) #vector of length L
loss.backward() #stores L distinct gradients in each param.grad, magically
But failing that, compute each gradient separately and then clip the norm before accumulating, but
x # inputs with batch size L
y #true labels
y_output = model(x)
loss = loss_func(y_output,y) #vector of length L
for i in range(loss.size()[0]):
loss[i].backward(retain_graph=True)
torch.nn.utils.clip_grad_norm(model.parameters(), clip_size)
accumulates the ith gradient, and then clips, rather than clipping before accumulating it into the gradient. What's the best way to get around this issue?