I'm training an image classification model with PyTorch Lightning and running on a machine with more than one GPU, so I use the recommended distributed backend for best performance ddp
(DataDistributedParallel). This naturally splits up the dataset, so each GPU will only ever see one part of the data.
However, for validation, I would like to compute metrics like accuracy on the entire validation set and not just on a part. How would I do that? I found some hints in the official documentation, but they do not work as expected or are confusing to me. What's happening is that validation_epoch_end
is called num_gpus
times with 1/num_gpus
of the validation data each. I would like to aggregate all results and only run the validation_epoch_end
once.
In this section they state that when using dp/ddp2 you can add an additional function called like this
def validation_step(self, batch, batch_idx):
loss, x, y, y_hat = self.step(batch)
return {"val_loss": loss, 'y': y, 'y_hat': y_hat}
def validation_step_end(self, self, *args, **kwargs):
# do something here, I'm not sure what,
# as it gets called in ddp directly after validation_step with the exact same values
return args[0]
However, the results are not being aggregated and validation_epoch_end
is still called num_gpu
times. Is this kind of behavior not available for ddp
? Is there some other way how achieve this aggregation behavior?