7

I'm training an image classification model with PyTorch Lightning and running on a machine with more than one GPU, so I use the recommended distributed backend for best performance ddp (DataDistributedParallel). This naturally splits up the dataset, so each GPU will only ever see one part of the data.

However, for validation, I would like to compute metrics like accuracy on the entire validation set and not just on a part. How would I do that? I found some hints in the official documentation, but they do not work as expected or are confusing to me. What's happening is that validation_epoch_end is called num_gpus times with 1/num_gpus of the validation data each. I would like to aggregate all results and only run the validation_epoch_end once.

In this section they state that when using dp/ddp2 you can add an additional function called like this

def validation_step(self, batch, batch_idx):
    loss, x, y, y_hat = self.step(batch)
    return {"val_loss": loss, 'y': y, 'y_hat': y_hat}

def validation_step_end(self, self, *args, **kwargs):
    # do something here, I'm not sure what, 
    # as it gets called in ddp directly after validation_step with the exact same values
    return args[0]

However, the results are not being aggregated and validation_epoch_end is still called num_gpu times. Is this kind of behavior not available for ddp? Is there some other way how achieve this aggregation behavior?

Alexander Pacha
  • 9,187
  • 3
  • 68
  • 108

2 Answers2

5

training_epoch_end() and validation_epoch_end() receive data that is aggregated from all training / validation batches of the particular process. They simply receive a list of what you returned in each training or validation step.

When using the DDP backend, there's a separate process running for every GPU. There's no simple way to access the data that another process is processing, but there's a mechanism for synchronizing a particular tensor between the processes.

The easiest approach for computing a metric on the entire validation set is to calculate the metric in pieces and then synchronize the resulting tensor, for example by taking the average. self.log() calls will automatically synchronize the value between GPUs when you use sync_dist=True. How the value is synchronized is determined by the reduce_fx argument, which by default is torch.mean.

If you're happy with averaging the metric over batches too, you don't need to override training_epoch_end() or validation_epoch_end()self.log() will do the averaging for you.

If the metric cannot be calculated separately for each GPU and then averaged, it can get a bit more challenging. It's possible to update some state variables at each step, and then synchronize the state variables at the end of an epoch and calculate the metric. The recommended way is to create a class that derives from the Metric class from the TorchMetrics project. Add the state variables in the constructor using add_state() and override the update() and compute() methods. The API will take care of synchronizing the state variables between the GPU processes.

There's already an accuracy metric in TorchMetrics and the source code is a good example of how to use the API.

Seppo Enarvi
  • 3,219
  • 3
  • 32
  • 25
0

I think you are looking for training_step_end/validation_step_end.

...So, when Lightning calls any of the training_step, validation_step, test_step you will only be operating on one of those pieces. (...) For most metrics, this doesn’t really matter. However, if you want to add something to your computational graph (like softmax) using all batch parts you can use the training_step_end step.

5Ke
  • 1,209
  • 11
  • 28
  • Yes, these functions [SHOULD](https://github.com/PyTorchLightning/pytorch-lightning/blob/404af43cde6696d04bb1899da2bb7e334e49716d/pytorch_lightning/accelerators/dp_accelerator.py#L140) do what I want, but as I stated in the question: I've tried these two functions and they are still called `num_gpu` times with the respective splits instead of the aggregated results. Maybe they forgot to [implement that](https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/accelerators/ddp_accelerator.py#L47)? – Alexander Pacha Nov 25 '20 at 15:46
  • Sorry, I read your question too quickly. I don't think you want to have this function with ddp, as this would go against what makes ddp so fast. By design (from the manual): "Each GPU gets visibility into a subset of the overall dataset. It will only ever see that subset." – 5Ke Nov 26 '20 at 07:59
  • 1.) I only want it for validation - there it does make sense, as stated in the documentation. I understand and accept the performance implications. 2.) This behavior is available for ddp2, so why wouldn't I want that in ddp? – Alexander Pacha Nov 26 '20 at 12:57
  • What would be the advantage of ddp with that kind of behaviour over using ddp2? – 5Ke Nov 26 '20 at 13:37
  • @AlexanderPacha did you solve this, I find the same behavior on test_step. I agree, we want this behavior in train_step, but I can't follow the docs to gather on test. They allude to it here, but not clear. https://github.com/PyTorchLightning/pytorch-lightning/issues/1166 – bw4sz Mar 11 '21 at 22:12
  • Check out https://github.com/PyTorchLightning/pytorch-lightning/issues/4853 – Alexander Pacha Mar 13 '21 at 16:20