0

I've got an array of bins' borders and I need to get a sum of values inside these bins. Now it looks as follows:

output = torch.zeros((16, 10)) #10 corresponds to the number of bins

for l in range(10):
   output[:,l] = data[:, bin_edges[l]:bin_edges[l+1]].sum(axis=-1)

Is it possible to avoid loops and improve the performance?

alvas
  • 115,346
  • 109
  • 446
  • 738
  • I would be surprised to see it performed with PyTorch operators since this involves non-contiguous slices... so it is very unlikely. – Ivan May 03 '22 at 08:29

1 Answers1

0

Normally to optimize code by vectorization you would like to construct a single big tensor on which you compute the result in a single operation. But here your bins might have different lengths, so you can't construct a tensor from that.

Though, that's a usual case in time-series processing, so PyTorch has some utilities to overcome this issue, such as torch.nn.utils.rnn.pad_sequence.

Using that utility I was able to optimize the function a bit, but the difference depends on the data shape and the number and length of bins, and sometimes performance even decreases.

Please note that pad_sequence assumes that you want to make bins from the first dimension of your data, and you make bins from the last dim, so the optimization would be better if you can reorganize your data accordingly.

Code

Implementations

from itertools import pairwise
import random
import torch
from torch.nn.utils.rnn import pad_sequence


def bins_sum(x, edges):
    """ Your function (generalized a bit) """
    edges = [0, *edges, x.shape[-1]]
    bins = enumerate(pairwise(edges))
    num_bins = len(edges) - 1
    output = torch.zeros(*(x.shape[:-1]), num_bins)

    for bin_idx, (start, end) in bins:
        output[..., bin_idx] = x[..., start:end].sum(axis=-1)
    return output


def bins_sum_opti(x, edges):
    """ Trying to optimize using torch.nn.utils.rnn """
    x = x.movedim(-1, 0)
    edges = [0, *edges, x.shape[0]]
    xbins = [x[start:end] for start, end in pairwise(edges)]
    xbins_padded = pad_sequence(xbins)
    return xbins_padded.sum(dim=0).movedim(0, -1)


def get_data_bin_edges(data_shape, num_edges):
    data = torch.rand(*data_shape)
    bin_edges = sorted(random.sample(range(3, data_shape[-1] - 3), k=num_edges))
    return data, bin_edges

Results

Assert that both functions are equivalent:

data, bin_edges = get_data_bin_edges(data_shape=(10, 20), num_edges=7)

res1 = bins_sum(data, bin_edges)
res2 = bins_sum_opti(data, bin_edges)

assert torch.allclose(res1, res2)

Time for different shapes and edges:

>>> data, bin_edges = get_data_bin_edges(data_shape=(10, 20), num_edges=3)
>>> %timeit bins_sum(data, bin_edges)
>>> %timeit bins_sum_opti(data, bin_edges)
35.8 µs ± 531 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
27.6 µs ± 546 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> data, bin_edges = get_data_bin_edges(data_shape=(10, 20), num_edges=7)
>>> %timeit bins_sum(data, bin_edges)
>>> %timeit bins_sum_opti(data, bin_edges)
67.4 µs ± 1.12 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
41.1 µs ± 107 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> data, bin_edges = get_data_bin_edges(data_shape=(10, 20, 30), num_edges=3)
>>> %timeit bins_sum(data, bin_edges)
>>> %timeit bins_sum_opti(data, bin_edges)
43 µs ± 195 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
33 µs ± 314 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> data, bin_edges = get_data_bin_edges(data_shape=(10, 20, 30), num_edges=7)
>>> %timeit bins_sum(data, bin_edges)
>>> %timeit bins_sum_opti(data, bin_edges)
90.5 µs ± 583 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
48.1 µs ± 134 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
paime
  • 2,901
  • 1
  • 6
  • 17