0

Iterable objects are those that implement __iter__ function, which returns an iterator object, i.e. and object providing the functions __iter__ and __next__ and behaving correctly. Usually the size of the iterable object is not known beforehand, and iterable object is not expected to know how long the iteration will last; however, there are some cases in which knowing the length of the iterable is valuable, for example, when creating an array. list(x for x in range(1000000)), for example, creates an initial array of small size, copies it after it is full, and repeats for many times as explained here. Of course, it is not that important in this example, but it explains the point.

Is there a protocol in use for those iterable objects who know their length beforehand? That is, is there a protocol extending Sized and Iterable but not Collection or Reversible? It seems like there is no such protocol in language features, is there such a protocol for well-known third-party libraries? How this discussion relates to generators?

Fırat Kıyak
  • 480
  • 1
  • 6
  • 18
  • Have a look: https://stackoverflow.com/questions/5384570/whats-the-shortest-way-to-count-the-number-of-items-in-a-generator-iterator – PM 77-1 Jun 03 '21 at 20:49
  • @PM77-1 I do not want to know the length of an arbitrary generator. I am coding the generator and I know what it's size will be. I want `len(generator)` to return that number. – Fırat Kıyak Jun 03 '21 at 20:53
  • 3
    Please explain what is "verbose" about that solution. You're defining a new class; you inherit from an existing class, or provide your own (trivial) iterator. Either way, you override the inherited (or default) `len` function. – Prune Jun 03 '21 at 20:56
  • @Prune Creating a class for a simple iteration procedure is verbose in the literal sense. There are too much words that complicate the code. Compare that with the syntax of a generator. – Fırat Kıyak Jun 03 '21 at 21:08
  • You asked only whether it was _possible_… – martineau Jun 03 '21 at 21:21
  • 3
    Are you comparing your new class with the syntax of *instantiating* a generator? Why would you expect those to be on the same order? Compare instead to the generator *definition* -- better yet, to any other single-method class extension. – Prune Jun 03 '21 at 21:40
  • @Prune You are totally right. I have improved the question thanks to your comments. – Fırat Kıyak Mar 10 '22 at 18:02

2 Answers2

2

It sounds like you're asking about something like __length_hint__. Excerpts from PEP 424 – A method for exposing a length hint:

CPython currently defines a __length_hint__ method on several types, such as various iterators. This method is then used by various other functions (such as list) to presize lists based on the estimate returned by __length_hint__. Types which are not sized, and thus should not define __len__, can then define __length_hint__, to allow estimating or computing a size (such as many iterators).

Being able to pre-allocate lists based on the expected size, as estimated by __length_hint__, can be a significant optimization. CPython has been observed to run some code faster than PyPy, purely because of this optimization being present.

For example, range iterators support this (Try it online!):

it = iter(range(1000))
print(it.__length_hint__())     # prints 1000
next(it)
print(it.__length_hint__())     # prints 999

And list iterators even take list length changes into account (Try it online!):

a = [None] * 10
it = iter(a)
print(it.__length_hint__())     # prints 10
next(it)
print(it.__length_hint__())     # prints 9
a.pop()
print(it.__length_hint__())     # prints 8
a.append(None)
print(it.__length_hint__())     # prints 9

Generator iterators don't support it, but you can support it in other iterators you write. Here's a demo iterator that...

  • Produces 10,000 elements.
  • Hints at having 5,000 elements.
  • After every 1,000 elements it shows the memory size of the list being built.
import gc

beacon = object()

class MyIterator:
    def __init__(self):
        self.n = 10_000
    def __iter__(self):
        return self
    def __length_hint__(self):
        print('__length_hint__ called')
        return 5_000
    def __next__(self):
        if self.n == 0:
            raise StopIteration
        self.n -= 1
        if self.n % 1_000 == 0:
            for obj in gc.get_objects():
                if isinstance(obj, list) and obj and obj[0] is beacon:
                    print(obj.__sizeof__())
        return beacon

list(MyIterator())

Output (Try it online!):

__length_hint__ called
45088
45088
45088
45088
45088
50776
57168
64360
72456
81560

We see that list asks for a length hint and from the start pre-allocates enough memory for 5,000 references of 8 bytes each, plus 12.5% overallocation. After the first 5,000 elements, it doesn't ask for length hints anymore, and keeps increasing its size bit by bit.

If my __length_hint__ instead accurately returns 10,000, then list instead pre-allocates 90088 bytes and that remains until the end.

Kelly Bundy
  • 23,480
  • 7
  • 29
  • 65
  • Indeed! I compared the running times of converting two iterators of the same length, 10^8, to lists, where one of them supports __len__. It was around %10 faster. – Fırat Kıyak Mar 13 '22 at 10:51
  • @FıratKıyak Did you mean to say `len` or did you mean `length_hint`? Can you share your code? Tried it myself now but I'm actually having trouble producing any significant speed difference. I'm trying this with Python code, which is just too slow in comparison and dominates the runtime. – Kelly Bundy Mar 13 '22 at 15:06
  • I run "%timeit list(X(10**8))" for 3 distinct classes for X. Every one of them is an iterator sharing the same code for __iter__, __next__, and __init__ which imitate the range iterator. One of them implements __len__ and gives the number of remaining terms in the iteration. Another one does the same, but it does that with __length_hint__ method instead of __len__. The other one does not implement any of the two functions. Both functions implementing a length function outperformed the one didn't by running about %5-%10 faster. Comment is too small for sharing the code. – Fırat Kıyak Mar 13 '22 at 19:03
  • @FıratKıyak That sounds like what I'm doing. Can you put it for example on [tio.run](https://tio.run/##1ZLBasMwDIbvfgod7RLKusAYhTzADnsG47ZKYxbLRlFgffrMicuWEdb7dLP0/b9k2ekmXaT6NfE0tRwDiA/oBXxIkQUYEzpRiqCBw9Nu96LUuXfDAO@3N0F2EvlwVJDjgi1Y68mLtXrAvjUlP8d83A9x5DNmH5@Fmh1dUZMxe2sJP7No7ZKJrQujjEyL2Yot4gfsvbE229Gf/83oP6Ie6U8N/eau0tnOkzzgVRsZLHiCcqv6Ds3pj2VbubR662q7vTkk7yZ40uW76N6F08UdofeD6MVGG1MBjeGE3GSTwjW1Md8WifOkWqrSdl6tC2htAUrRTNMX) or ideone.com or some pastebin? – Kelly Bundy Mar 13 '22 at 21:12
  • https://pastebin.com/7gFpZrCe – Fırat Kıyak Mar 13 '22 at 21:42
  • @FıratKıyak Thanks. I tried that on Google Colab, there it does seem to make a 5% difference. And with mine from my previous comment I see about a 10% difference there as well. – Kelly Bundy Mar 13 '22 at 22:44
1

If I now understand your question, you're still trying to combine two concepts that don't combine in quite this way. generator is a subclass of iterator; it's a process. len applies to data objects -- in particular, to the iterable object, as opposed to the iterator that traverses the object.

Therefore, a generator doesn't really have a length of its own. It returns a sequence of values, and that sequence has a length (when the generator finishes). Can you describe the concept you have of "generator with length" -- if it differs from what I just described?

If you keep that distinction in mind, then yes, you can implement __len__ as an extension to your class. You can add anything you like -- say, a sqrt function (See Conway's surreal numbers for details).

Prune
  • 76,765
  • 14
  • 60
  • 81
  • 1
    Even though there is a distinction between generator function and generator iterator: https://docs.python.org/3/glossary.html#term-generator , you are right either way. The generators are not iterable. I have realized generators obscure the question I am asking, and I edited the question greatly, hoping that what I am asking is now clear. – Fırat Kıyak Mar 11 '22 at 22:22