4

(Extracted from another question.) Removing this set's 200,000 elements one by one like this takes 30 seconds (Attempt This Online!):

s = set(range(200000))
while s:
    for x in s:
        s.remove(x)
        break

Why is that so slow? Removing a set element is supposed to be fast.

Kelly Bundy
  • 23,480
  • 7
  • 29
  • 65
  • How much slower is it than just looping through the set without doing anything? – Barmar Apr 27 '23 at 17:46
  • what's strange to me is that this isn't raising an error on the first iteration for trying to change the size of a set while iterating – juanpa.arrivillaga Apr 27 '23 at 17:46
  • @Barmar significantly slower. – juanpa.arrivillaga Apr 27 '23 at 17:46
  • @juanpa.arrivillaga Well there's the `break`, so no continued iteration. – Kelly Bundy Apr 27 '23 at 17:47
  • You don't need the `while` loop. Just `for x in s: s.remove(x)` – Barmar Apr 27 '23 at 17:48
  • @KellyBundy so the error isn't raised until the next `__next__` or something? – juanpa.arrivillaga Apr 27 '23 at 17:48
  • @Barmar that will raise an error though – juanpa.arrivillaga Apr 27 '23 at 17:48
  • @Barmar About 1000x slower. Pure iteration takes ~6 ms. [Demo](https://ato.pxeger.com/run?1=bY87DsJADET7PYXL3SIIOoSUs6AFeckK9iPbIeEsNGngTrkN-RAKYMrxvLHm_sw3qVLsukctrtj2vaMUQHxA8CEnEshIbn9MdRQksDzdlGIogVE02XhCvVmPMmbwxQ5MOaW0US4RtOAj8E7BoGyZVSYfRc8JKGBCftGm8hdcuO-eUbwiDOmKujUf70Boz_8fzAPfO5e9Lw). – Kelly Bundy Apr 27 '23 at 17:50
  • I think the slowness is from `for x in s:`. If you do `for i in range(100000): s.remove(i)` it's fast. – Barmar Apr 27 '23 at 17:51
  • @Barmar The code structure including the `while` comes from that other question and is crucial. – Kelly Bundy Apr 27 '23 at 17:51
  • So the expense is from creating 100,000 set iterators. But `for _ in range(100000): i = iter(s)` doesn't take any time, either. Not sure where the cost really is. – Barmar Apr 27 '23 at 17:55
  • How much work does the garbage collector have to do from disposing the iterator after every broken loop? – slothrop Apr 27 '23 at 17:55
  • @slothrop Not much, that's not it. – Kelly Bundy Apr 27 '23 at 18:01
  • so when the set iterator advances, it goes through the hashtable checking each position until it finds an occupied one. With this "broken loop" style, it probably does quite a lot of work repeatedly checking and skipping the same empty positions on each new for-loop? https://github.com/python/cpython/blob/0b7fd8ffc5df187edf8b5d926cee359924462df5/Objects/setobject.c#L815 – slothrop Apr 27 '23 at 18:12
  • @slothrop Please remember the comment field's instruction *"Avoid answering questions in comments"* (some previous comments were borderline, but at least they were wrong). – Kelly Bundy Apr 27 '23 at 18:14
  • 2
    @slothrop to clarify, the number of empty positions it has to iterate over *increases* each time, hence the quadratic time behavior. See my answer for my proposal of why this behavior occurs. – juanpa.arrivillaga Apr 27 '23 at 18:21
  • @juanpa.arrivillaga *"so the error isn't raised until the next `__next__` or something?"* - Yes, it's the iterator's `__next__` method that *would* complain about the change during iteration. It's just not called again (for the same iterator), thanks to the break. – Kelly Bundy Apr 27 '23 at 19:42

1 Answers1

7

I think this is happening because you are removing the first element in the set every time. This creates a set which is increasingly empty one each iteration, so each time you create a new iterator and call __next__, it has to search further and further away.

So, here is the source code for the iterator __next__

It has to find the next entry like this:

while (i <= mask && (entry[i].key == NULL || entry[i].key == dummy))
    i++;

The the iterator __next__ works by finding the first non-empty, non-dummy value:

So, say we have something like:

entries = [null, 1, null, 2, null, 3, null, 4,  null, 5]

Then on each iteration of the while loop, you get:

entries = [null, 1, null, 2, null, 3, null, 4,  null, 5]
entries = [null, DUMMY, null, 2, null, 3, null, 4,  null, 5]
entries = [null, DUMMY, null, DUMMY, null, 3, null, 4,  null, 5]
entries = [null, DUMMY, null, DUMMY, null, DUMMY, null, 4,  null, 5]

So each time, the iterator has to search further and further away from the beginning of the entires, since each iteration of the while loop removes the first one. Hence, the quadratic time behavior.

juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172
  • Yep, that's it (well, together with iteration starting at index 0, which is set in the iterator creation. Maybe it could use the set's "finger" like `pop` does (which makes it fast), didn't really look into that). – Kelly Bundy Apr 27 '23 at 18:25
  • Such a [Shlemiel](https://www.joelonsoftware.com/2001/12/11/back-to-basics/). – Kelly Bundy Apr 27 '23 at 18:27
  • At least on CPython (tested Py3.10.8), ``sys.getsizeof`` also clearly shows the set does not shrink after removing items. (PyPy doesn't offer `getsizeof` but also does not seem to suffer from the slowdown problem.) – MisterMiyagi Apr 27 '23 at 18:27
  • @KellyBundy yeah, I'm not really sure how the finger works exactly, so not sure why they don't use it. – juanpa.arrivillaga Apr 27 '23 at 18:28
  • 1
    Even it it could be used, I suspect they'd say "You should use `pop` instead anyway". – Kelly Bundy Apr 27 '23 at 18:30
  • 1
    The finger is just an index, stored in the set's metadata. Exclusively used by pop. After popping a value it stores the *next* position as the finger so that the next pop can start searching from there instead of from index 0. Repeated pops are a common enough use case that it's worth avoiding the slowness demonstrated here. – Kelly Bundy Apr 27 '23 at 18:40