Normally to optimize code by vectorization you would like to construct a single big tensor on which you compute the result in a single operation.
But here your bins might have different lengths, so you can't construct a tensor from that.
Though, that's a usual case in time-series processing, so PyTorch has some utilities to overcome this issue, such as torch.nn.utils.rnn.pad_sequence
.
Using that utility I was able to optimize the function a bit, but the difference depends on the data shape and the number and length of bins, and sometimes performance even decreases.
Please note that pad_sequence
assumes that you want to make bins from the first dimension of your data, and you make bins from the last dim, so the optimization would be better if you can reorganize your data accordingly.
Code
Implementations
from itertools import pairwise
import random
import torch
from torch.nn.utils.rnn import pad_sequence
def bins_sum(x, edges):
""" Your function (generalized a bit) """
edges = [0, *edges, x.shape[-1]]
bins = enumerate(pairwise(edges))
num_bins = len(edges) - 1
output = torch.zeros(*(x.shape[:-1]), num_bins)
for bin_idx, (start, end) in bins:
output[..., bin_idx] = x[..., start:end].sum(axis=-1)
return output
def bins_sum_opti(x, edges):
""" Trying to optimize using torch.nn.utils.rnn """
x = x.movedim(-1, 0)
edges = [0, *edges, x.shape[0]]
xbins = [x[start:end] for start, end in pairwise(edges)]
xbins_padded = pad_sequence(xbins)
return xbins_padded.sum(dim=0).movedim(0, -1)
def get_data_bin_edges(data_shape, num_edges):
data = torch.rand(*data_shape)
bin_edges = sorted(random.sample(range(3, data_shape[-1] - 3), k=num_edges))
return data, bin_edges
Results
Assert that both functions are equivalent:
data, bin_edges = get_data_bin_edges(data_shape=(10, 20), num_edges=7)
res1 = bins_sum(data, bin_edges)
res2 = bins_sum_opti(data, bin_edges)
assert torch.allclose(res1, res2)
Time for different shapes and edges:
>>> data, bin_edges = get_data_bin_edges(data_shape=(10, 20), num_edges=3)
>>> %timeit bins_sum(data, bin_edges)
>>> %timeit bins_sum_opti(data, bin_edges)
35.8 µs ± 531 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
27.6 µs ± 546 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> data, bin_edges = get_data_bin_edges(data_shape=(10, 20), num_edges=7)
>>> %timeit bins_sum(data, bin_edges)
>>> %timeit bins_sum_opti(data, bin_edges)
67.4 µs ± 1.12 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
41.1 µs ± 107 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> data, bin_edges = get_data_bin_edges(data_shape=(10, 20, 30), num_edges=3)
>>> %timeit bins_sum(data, bin_edges)
>>> %timeit bins_sum_opti(data, bin_edges)
43 µs ± 195 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
33 µs ± 314 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> data, bin_edges = get_data_bin_edges(data_shape=(10, 20, 30), num_edges=7)
>>> %timeit bins_sum(data, bin_edges)
>>> %timeit bins_sum_opti(data, bin_edges)
90.5 µs ± 583 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
48.1 µs ± 134 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)