13

I'm running the following code in an ipython session:

# This call is slow, but that is expected. (It loads 3 GB of data.)
In [3]: arc, arc_sub, upls, go = foo_mod.ready_set()

# This call is also slow, as `upls` is huge.
In [4]: upls = list(upls)

# This call is slow in meatspace, but `%timeit` doesn't notice!
In [5]: %timeit -n1 -r1 len(upls)
1 loops, best of 1: 954 ns per loop

%timeit is straight-up lying here. With or without the %timeit, the command takes upwards of 10s to actually run. Only the first time, however; subsequent calls to len are quick.

Even time.time() sings a similar tune:

In [5]: import time

In [6]: s = time.time(); len_ = len(upls); e = time.time()

In [7]: e - s
Out[7]: 7.104873657226562e-05

But it took seconds in the real world for In [6] to actually complete. I just don't seem to be able to capture where the actual time is being spent!

There's nothing terribly unusual about the list, aside from it's huge: it's a real list; it holds ~¼ billion bson.ObjectId objects. (Prior to the list() call, it's a set object; that call is also slow, but that makes sense; list(<set instance>) is O(n), and my set is huge.)

Edit re GC

If I run gc.set_debug(gc.DEBUG_STATS) just prior to ready_set, which itself is a slow call, I see tons of GC cycles. This is expected. gen3 grows:

gc: objects in each generation: 702 701 3289802
gc: done, 0.0000s elapsed.
gc: collecting generation 0...
gc: objects in each generation: 702 1402 3289802
gc: done, 0.0000s elapsed.
gc: collecting generation 0...
gc: objects in each generation: 702 2103 3289802

Unfortunately the console outputs make this runtime of this impossibly slow. If I instead delay the gc.set_debug call until just after ready_set, I don't see any GC cycles, but gc.get_count() claims the generations are tiny:

In [6]: gc.get_count()
Out[6]: (43, 1, 193)

In [7]: len(upls)
Out[7]: 125636395

(but why/how is get_count less objects than what's in the list?; they're definitely all unique, since they just went through a set…) The fact that involving gc in the code makes len speedy leads me to believe I'm paused for a collect-the-world.

(Versions, just in case:

Python 2.7.6 (default, Mar 22 2014, 22:59:56)
IPython 3.2.0 -- An enhanced Interactive Python.

)

Thanatos
  • 42,585
  • 14
  • 91
  • 146
  • 4
    That's bizarre. Does the same thing happen in the regular interactive interpreter? – user2357112 Feb 03 '16 at 23:11
  • 3
    lists know their size. once it's in a list, len(thelist) is O(1) – Chad S. Feb 03 '16 at 23:15
  • @ChadS. For lists to know their size they have to calculate it at some point. I think that caching happens at first call to `len()`. – freakish Feb 03 '16 at 23:17
  • 6
    No. they just have to keep track of when items are inserted and removed. It doesn't calculate it when you call len. – Chad S. Feb 03 '16 at 23:18
  • @ChadS. That would be insanely inefficient. And a huge waste if you never call `len()`. – freakish Feb 03 '16 at 23:18
  • 6
    No it wouldn't, and knowing the len() of the list is needed for nearly ANY list operation. – Chad S. Feb 03 '16 at 23:19
  • @ChadS. Dude, it would. I know that people nowadays think that simple addition costs nothing but it does if you modify list all the time. Plus for what other operation needs it? – freakish Feb 03 '16 at 23:20
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/102530/discussion-between-chad-s-and-freakish). – Chad S. Feb 03 '16 at 23:20
  • 1
    At this point, I've read the source code for Python 2.7. `list` knows its size intrinsically. The implementation of `len` for `list` is very simple. :-/ Unfortunately, it doesn't match the behavior I'm seeing. – Thanatos Feb 03 '16 at 23:20
  • Do you happen to be running in an IDE like Spyder? – CookieOfFortune Feb 08 '16 at 20:11

1 Answers1

4

I will summarize the comments to your question to the answer.

As everyone said (and you pointed it out), Python's list object knows its size and it returns just the stored number:

static Py_ssize_t
list_length(PyListObject *a)
{
    return Py_SIZE(a);
}

Where Py_SIZE is defined:

Py_SIZE(o)

This macro is used to access the ob_size member of a Python object. It expands to: (((PyVarObject*)(o))->ob_size)

So I can conclude it should not do any calculations. The only suspected is the object you trying to convert to list. But if you swear it is really list without any fake objects simulating its method with some lazy calculations - it is not it.

So I assume that all timeit methods are really showing exact time spent to call len function.

And the only time-wasting process is.. Garbage Collector. At the end of your measurements it discovers that noone uses such large piece of data and starts releasing the memory. Sure, it takes some seconds.

baldr
  • 2,891
  • 11
  • 43
  • 61