3

What is your question?

I am trying to implement a metric which needs access to whole data. So instead of updating the metric in *_step() methods, I am trying to collect the outputs in the *_epoch_end() methods. However, the outputs contain only the output of the partition of the data each device gets. Basically if there are n devices, then each device is getting 1/n of the total outputs.

What's your environment?

OS: ubuntu
Packaging: conda
Version [1.0.4
Pytorch: 1.6.0
pseudo_teetotaler
  • 1,485
  • 1
  • 15
  • 35
  • I am also facing similar problem but instead of a metric. I am trying to return the predictions for the test dataset. It returns only chunks. They advertise an easy multi-gpu setup but it is nothing like that, at least for me. – mrb May 06 '21 at 17:12

2 Answers2

0

See the pytorch-lightningmanual. I think you are looking for training_step_end/validation_step_end (assuming you are using DP/DDP2).

...So, when Lightning calls any of the training_step, validation_step, test_step you will only be operating on one of those pieces. (...) For most metrics, this doesn’t really matter. However, if you want to add something to your computational graph (like softmax) using all batch parts you can use the training_step_end step.

5Ke
  • 1,209
  • 11
  • 28
  • Yep. That's exactly the reason I am implementing my logic in `training_step_end` method instead of `training_step`. However, I am getting only partial data in there. – pseudo_teetotaler Nov 25 '20 at 23:10
  • Are you maybe using ddp? According to the manual, with ddp, by design: "Each GPU gets visibility into a subset of the overall dataset. It will only ever see that subset." – 5Ke Nov 26 '20 at 08:01
  • That's for `validation_step` or `training_step` methods. For `*_step_end` methods are supposed to be called after the outputs are aggregated from each machines. – pseudo_teetotaler Nov 26 '20 at 23:08
0

When using the DDP backend, there's a separate process running for every GPU. They don't have access to each other's data, but there are a few special operations (reduce, all_reduce, gather, all_gather) that make the processes synchronize. When you use such operations on a tensor, the processes will wait for each other to reach the same point and combine their values in some way, for example take the sum from every process.

In theory it's possible to gather all data from all processes and then calculate the metric in one process, but this is slow and prone to problems, so you want to minimize the data that you transfer. The easiest approach is to calculate the metric in pieces and then for example take the average. self.log() calls will do this automatically when you use sync_dist=True.

If you don't want to take the average over the GPU processes, it's also possible to update some state variables at each step, and after the epoch synchronize the state variables and calculate your metric from those values. The recommended way is to create a class that uses the Metrics API, which recently moved from PyTorch Lightning to the TorchMetrics project.

If it's not enough to store a set of state variables, you can try to make your metric gather all data from all the processes. Derive your own metric from the Metric base class, overriding the update() and compute() methods. Use add_state("data", default=[], dist_reduce_fx="cat") to create a list where you collect the data that you need for calculating the metric. dist_reduce_fx="cat" will cause the data from different processes to be combined with torch.cat(). Internally it uses torch.distributed.all_gather. The tricky part here is that it assumes that all processes create identically-sized tensors. If the sizes don't match, syncing will hang indefinitely.

Seppo Enarvi
  • 3,219
  • 3
  • 32
  • 25