Why is b.pop(0) over 200 times slower than del b[0] for bytearray?

Question

Letting them compete three times (a million pops/dels each time):

from timeit import timeit

for _ in range(3):
    t1 = timeit('b.pop(0)', 'b = bytearray(1000000)')
    t2 = timeit('del b[0]', 'b = bytearray(1000000)')
    print(t1 / t2)

Time ratios (Try it online!):

274.6037053753368
219.38099365582403
252.08691226683823

Why is pop that much slower at doing the same thing?

There's at least an assignment involved in `.pop()` but the difference is quite extraordinary — roganjosh, Feb 05 '23 at 18:07
Probably has to do with the fact that `pop()` has to return the value, hence it has to do reference counts and also [converting the return value](https://github.com/python/cpython/blob/cf89c16486a4cc297413e17d32082ec4f389d725/Objects/bytearrayobject.c#L1827). And Python also takes a lot of time in attribute access(`b.pop` --> find the attribute, and then calling it is expensive but can be improved a bit by doing `pop = b.pop`) and it doesn't have to worry about all that with `del` being a single bytecode instruction. — Ashwini Chaudhary, Feb 05 '23 at 18:35
@AshwiniChaudhary No, those things don't make that much of a difference. Even `pop(0)` vs `b[0]; del b[0]; b.pop` was still [over 100](https://tio.run/##dYxBDgIhDEX3c4rugMQozGyME09ijIHIKIlA03TD6RGFhRv/pv3ty8PCz5yWI1KtG@UIHKIPDCFiJh5tmrZMcIOQgGx6eLmo0wQtbOA8GCkwo9RK7EC4dnWFvSWyRRr9jVqhEZ/Pvk2humD@EbiLvq5w9y/oWwf/@oYCKSSWbA48q1rf). — Kelly Bundy, Feb 05 '23 at 18:52
this probably goes without saying but if performance is a concern you should either reverse the array order or get a deq — Tornado547, Feb 07 '23 at 03:46
@Tornado547 No. Reversing makes it unnatural, iconfusing, inconvenient, and takes extra time. And `collections.deque` (if that's what you mean) is far **less** performant at jobs for bytearrays and is lacking features like substring search. — Kelly Bundy, Feb 07 '23 at 03:58
I'm shocked you can do either of these on a `bytearray`. You'd have thought that's fixed-length... — user541686, Feb 07 '23 at 06:12
This is only for bytearray. And those time ratios quoted are for 3.8. What are the numbers for 3.9, 3.10, 3.11? — smci, Mar 06 '23 at 01:52
@smci Yes, only for bytearray. Not sure why you're saying that. The title already says "for bytearray". I had also checked 3.10 and got similar ratios, and I knew the reason for the speed difference and thus had no doubt that others would see similar ratios on all recent versions. I just chose to link to TIO because it caches the results. — Kelly Bundy, Mar 06 '23 at 10:01

score 82 · Accepted Answer · edited Feb 08 '23 at 22:28

82

When you run b.pop(0), Python moves all the elements back by one as you might expect. This takes O(n) time.

When you del b[0], Python simply increases the start pointer of the object by 1.

In both cases, PyByteArray_Resize is called to adjust the size. When the new size is smaller than half the allocated size, the allocated memory will be shrunk. In the del b[0] case, this is the only point where the data will be copied. As a result, this case will take O(1) amortized time.

Relevant code:

bytearray_pop_impl function: Always calls

memmove(buf + index, buf + index + 1, n - index);

The bytearray_setslice_linear function is called for del b[0] with lo == 0, hi == 1, bytes_len == 0. It reaches this code (with growth == -1):

if (lo == 0) {
    /* Shrink the buffer by advancing its logical start */
    self->ob_start -= growth;
    /*
      0   lo               hi             old_size
      |   |<----avail----->|<-----tail------>|
      |      |<-bytes_len->|<-----tail------>|
      0    new_lo         new_hi          new_size
    */
}
else {
    /*
      0   lo               hi               old_size
      |   |<----avail----->|<-----tomove------>|
      |   |<-bytes_len->|<-----tomove------>|
      0   lo         new_hi              new_size
    */
    memmove(buf + lo + bytes_len, buf + hi,
            Py_SIZE(self) - hi);
}

edited Feb 08 '23 at 22:28

Michael M.

10,486
9
18
34

answered Feb 05 '23 at 19:24

interjay

107,303
21
270
254

15

Good answer. Do you know why developers did not choose to increment the pointer also for `b.pop(0)`? – Jérôme Richard Feb 05 '23 at 20:53
@JérômeRichard Probably an oversight, I don't see a reason not to apply this optimization there as well. – interjay Feb 06 '23 at 00:49
23

@JérômeRichard I've seen it many times. They don't want to optimize uncommon cases. Both to avoid slowing down *common* cases even slightly (might be a net loss overall!), and to avoid having extra code that needs to be written, tested, reviewed, maintained, read, may introduce errors, and may complicate things. See the original [discussion of the slicing optimization](https://github.com/python/cpython/issues/63287) for a perfect example of all that. – Kelly Bundy Feb 06 '23 at 07:18
5

Someone had use cases for the *slicing* optimization (FIFO buffers), and it was somewhat of a fight. Amusingly at some point they spoke of "popping" but weren't talking about pop() but about removing a slice. They never spoke about popping single bytes, and I do believe that *is* really uncommon for bytearrays. Even from the end. Especially when the array is as huge as mine. And my speed ratio is also much smaller when the array is much smaller. – Kelly Bundy Feb 06 '23 at 07:18
2

So I suspect nobody ever even *desired* bytearray's `pop(0)` to be optimized. I didn't, either, I have no real use for it, I only stumbled upon this when looking for something else. And if I *did* have a use for it, I could most likely use `del` instead and would just do that. – Kelly Bundy Feb 06 '23 at 07:30
@KellyBundy Here's my *guess* at the motivating logic. When you use `del` you're explicitly saying to make the array smaller, so doing it with a pointer offset is perfectly safe. OTOH, when using `.pop`, you're most likely using the array as a queue, so you probably don't want to shrink the array on every item you pop. Of course, that logic doesn't work so well when the array is huge, due to the large cost of moving subsequent items each time you pop. – PM 2Ring Feb 06 '23 at 09:28
4

@PM2Ring Hmm, I don't think so. We *know* why *slice* deletion got implemented, that's in the linked original discussion. I *suspect* (didn't check) that `del` with single index got optimized as a side effect. And optimizing `pop` was never even mentioned, at least in that discussion. – Kelly Bundy Feb 06 '23 at 09:46
2

@PM2Ring [The diff](https://github.com/python/cpython/commit/5df8a8a1fd6cc6f4469dc7d3994d06e2aea24c52), in case someone wants to check whether anything was done explicitly to optimize simple Index deletion. – Kelly Bundy Feb 06 '23 at 09:48
1

@PM2Ring For simple index deletion, "[Fall through to slice assignment](https://github.com/python/cpython/blame/79903240480429a6e545177416a7b782b0e5b9bd/Objects/bytearrayobject.c#L620-L626)" predates the slice optimization by years, so indeed it was optimized as side effect. – Kelly Bundy Feb 07 '23 at 10:21
It might also be worth mentioning that the element-moving approach can turn out to be a net gain in some situations, as it prevents fragmentation into memory regions with unrecovered "holes", thus preventing an overall deterioration of CPU cache locality. – Will Feb 07 '23 at 11:03
@Will Not sure what you mean. Does overallocation at the end lead to less fragmentation than overallocation at the start? How so? Note the optimization discussion briefly brought up fragmentation, and they said it can **reduce** it (but I didn't quite understand it). – Kelly Bundy Feb 07 '23 at 11:25
@KellyBundy surely `x.pop(0)` is way more common than `del x[0]` ? – theonlygusti Feb 07 '23 at 15:15
@theonlygusti I don't know. For `deque` and probably `list` I would agree, but for `bytearray`? I have no idea. With `bytearray`, I don't think I've ever used either of them before, or ever seen anyone else use them. Have you? Under the question, a 200k reputation user even commented they're "shocked" you *can* do either of these on a `bytearray`. The point really isn't which one of these is *more* commonly used, but that **both** of them are very rarely if ever used at all. And so **neither** of them were considered for optimization (as far as I know). – Kelly Bundy Feb 07 '23 at 16:54
I got confused reading your answer. The individual move following `pop()` is O(n) and the individual resize after `del` is amortized O(1). However, in the problem, using `pop()` there are n-1 moves and log_2(n) resizes. So the actual memory shuffle costs of using `b.pop(0)` and `del b[0]` are O(n*n) and O(log(n)), respectively. I was missing this information while reading. – user1129682 Feb 07 '23 at 17:17
@user1129682 No, for `del` the total shuffle costs are O(n). The amortized O(1) isn't per resize but per `del`. – Kelly Bundy Feb 07 '23 at 17:24
@KellyBundy You are correct I didn't count increasing the pointer, only the resize. Thanks for clarifying – user1129682 Feb 07 '23 at 17:29
@user1129682 Actually I meant the memory moves during the resizes. In my test with a million initial elements and a million dels, after deleting 500000 elements, the remaining 500000 get moved (unless the system cheats, I guess, fake-reallocating in place). After the next 250000 dels, the remaining 250000 get moved. Etc. The sum is (almost) a million. – Kelly Bundy Feb 07 '23 at 17:45
@user1129682 And the answer also meant the resize costs there. You can tell by the "amortized" (and by the explanation leading to it). The counter increase during each `del`, that's O(1). Not just amortized O(1). – Kelly Bundy Feb 07 '23 at 18:21
@KellyBundy Fragmentation occurs when arrays are not aligned along the page (chunk / bucket / cache line) sizes corresponding to the allocator's granularity. By only moving the array boundary left/right on one side of the array, one can assure that the other side remains aligned. In most scenarios this is only a minor consideration that doesn't outweigh the operational overhead, but in the case of many small dynamic arrays it can be worth avoiding moving the start pointer to unaligned offsets. – Will Feb 08 '23 at 00:07

score 27 · Answer 2 · answered Feb 05 '23 at 19:24

I have to admit, I was very surprised by the timings myself. After convincing myself that they were in fact correct, I took a dive into the CPython source code, and I think I found the answer- cpython optimizes del bytearr[0:x], by just incrementing the pointer to the start of the array:

    if (growth < 0) {
        if (!_canresize(self))
            return -1;

        if (lo == 0) {
            /* Shrink the buffer by advancing its logical start */
            self->ob_start -= growth;

You can find the del bytearray[...] logic here (implemented via bytearray_setslice, with values being NULL), which in turn calls bytearray_setslice_linear, which contains the above optimization.

For comparison, bytearray.pop does NOT implement this optimization- see here in the source code.

Why is b.pop(0) over 200 times slower than del b[0] for bytearray?

2 Answers2