Swapping the batch axis has effect on the performance in pytorch?

Question

I know that usually the batch dimension is axis zero, and I imagine this has a reason: The underlying memory for each item in the batch is contiguous.

My model calls a function that becomes simpler if I have another dimension in the first axis, so that I can use x[k] instead of x[:, k].

Results from arithmetic operations seems to keep the same memory layout

x = torch.ones(2,3,4).transpose(0,1)
y = torch.ones_like(x)
u = (x + 1)
v = (x + y)
print(x.stride(), u.stride(), v.stride())

When I create additional variables I am creating them with torch.zeros and then transposing, so that the largest stride goes to the axis 1, as well.

e.g.

a,b,c = torch.zeros(
         (3, x.shape[1], ADDITIONAL_DIM, x.shape[0]) + x.shape[2:]
).transpose(1,2)

Will create three tensors with the same batch size x.shape[1]. In terms of memory locality it would make any difference to have

a,b,c = torch.zeros(
  (x.shape[1], 3, ADDITIONAL_DIM, x.shape[0]) + x.shape[2:]
).permute(1,2,0, ...)

instead.

Should I care about this at all?

So your question is whether there is any difference between doing a permutation then working with `x[k]`; and working with `x[:, k]` without needing a permutation right? — Ivan, Aug 25 '21 at 08:11

score 1 · Accepted Answer · answered Aug 25 '21 at 08:58

1

TLDR; Slices seemingly contain less information... but in fact share the identical storage buffer with the original tensor. Since permute doesn't affect the underlying memory layout, both operations are essentially equivalent.

Those two are essentially the same, the underlying data storage buffer is kept the same, only the metadata i.e. how you interact with that buffer (strides and shape) changes.

Let us look at a simple example:

>>> x = torch.ones(2,3,4).transpose(0,1)
>>> x_ptr = x.data_ptr()

>>> x.shape, x.stride(), x_ptr
(3, 2, 4), (4, 12, 1), 94674451667072

We have kept the data pointer for our 'base' tensor in x_ptr:

Slicing on the second axis:

>>> y = x[:, 0]

>>> y.shape, y.stride(), x_ptr == y.data_ptr()
(3, 4), (4, 1), True

As you can see, x and x[:, k] shared the same storage.

Permuting the first two axes then slicing on the first one:
```
>>> z = x.permute(1, 0, 2)[0]

>>> z.shape, z.stride(), x_ptr == z.data_ptr()
(3, 4), (4, 1), True
```
Here again, you notice that x.data_ptr is the same as z.data_ptr.

In fact, you can even go from y to x's representation using torch.as_strided:

>>> torch.as_strided(y, size=x.shape, stride=x.stride())
tensor([[[1., 1., 1., 1.],
         [1., 1., 1., 1.]],

        [[1., 1., 1., 1.],
         [1., 1., 1., 1.]],

        [[1., 1., 1., 1.],
         [1., 1., 1., 1.]]])

Same with z:

>>> torch.as_strided(z, size=x.shape, stride=x.stride())

Both will return a copy of x because torch.as_strided is allocating memory for the newly created tensor. These two lines were just to illustrate how we can still 'get back' to x from a slice of x, we can recover the apparent content by changing the tensor's metadata.

answered Aug 25 '21 at 08:58

Ivan

34,531
8
55
100

Yes, I know all of that, my question is about performance. I am specially interested on how pytorch will parallelize the operations and the performance of the memory transfer using different memory layouts. – Bob Aug 25 '21 at 09:51
Well, I know this at the surface, I want to know about performance in distributed computing, e.g. in a GPU, without having to write the two implementations and bencmarking – Bob Aug 25 '21 at 10:05
On either two method, the operations performed will use the same memory storage layout, so yes the performance is the same. – Ivan Aug 25 '21 at 10:14
And what is better is to allocate and assign using slices `a[:] = x + y` or to to use `a = (x + y).type(a.dtype)`? – Bob Aug 25 '21 at 10:31
Why not just `a = x + y`? Anyhow both are identical. – Ivan Aug 26 '21 at 10:36
They are not identical. This does not answer my questions, I will accept this answer as is for lack of better answer and for consideration to your effort. When I have time I run some tests on my own. Thanks. – Bob Aug 26 '21 at 12:38
Why are you saying that? `x[:] = a` will make a copy of `a` yes. However `x[:] = x + y` won't because `x + y` is a temporary variable. – Ivan Aug 26 '21 at 12:54
Because `a[:] = x + y` does not create a new variable, don't discard the reference to an existing tensor. If you have some high level material about how the temporary variables can be represented I would like to see more about it as well. Sometimes I don't know exactly if an operation will produce a physical tensor or some sort of view that can exist without being stored, also if the computation is deferred it must keep a reference to the objects to which it depends, hindering the gc work. – Bob Aug 26 '21 at 15:24
I don't quite understand your first point: *does not create a new variable* which variable do you refer to here? – Ivan Aug 26 '21 at 15:34

Swapping the batch axis has effect on the performance in pytorch?

1 Answers1