15

I would like to get your advice on the most pythonic way to express the following function in python with type hints:

I'd like to expose a function as part of a library that accepts an input argument and returns an output. The contract for the input argument should be that:

  • my function can iterate over it
  • it's ok if my function maintains a reference to the input (e.g. by returning an object that keeps that reference)
  • it's ok to iterate over the input more than once

An example could be a function that accepts a sequence of URLs and then issues requests to these URLs, potentially with some retry logic so I'd have to iterate the original sequence more than once. But my question is more generic than just this sample.

At first glance a suitable signature would be:

from typing import Iterable

def do_sth(input: Iterable[str]) -> SomeResult:
  ...

However this violates the third requirement, because in python there is no guarantee that you can iterate over an Iterable more than once, e.g. because iterators and generators are themselves iterables.

Another attempt might be:

from typing import Sequence

def do_sth(input: Sequence[str]) -> SomeResult:
  ...

But then the Sequence contract is more than my function requires, because it includes indexed access and also the knowledge of the length.

A solution that came to my mind is to use the Iterable signature and then make a copy of the input internally. But this seems to introduce a potential memory problem if the source sequence is large.

Is there a solution to this, i.e. does python know about the concept of an Iterable that would return a new iterator each time?

Carsten
  • 468
  • 4
  • 16
  • I wouldn't overthink this. Points 2 and 3, for most practical purposes, mean you need a list, or are going to turn whatever you get into a list. Just type `input` as `List[str]` and let the caller worry about how to create one from their possibly non-reiterable value. – chepner Jul 26 '20 at 19:37
  • 1
    Consider `itertools.cycle`: `foo2 = cycle(foo)` *internally* caches the elements of `foo` so that it can repeat them *ad infinitum*, but `foo` and `foo2` aren't independent anymore; calling `next` on one consumes a item from the other. Another example is `itertools.tee`; the documentation explicitly says you can't use `foo` reliably after `f1, f2 = tee(foo)`; you have to use `f1` and f2` instead. – chepner Jul 26 '20 at 19:41
  • ‚However this violates the third requirement, because in python there is no guarantee that you can iterate over an Iterable more than once, e.g. because iterators and generators are themselves iterables.‘ I think this enforces the input to be immutable -> a tuple would be the right choice then – Pablo Henkowski Jul 26 '20 at 20:09
  • @chepner thanks, I guess I'd require a `Tuple` in this case since I'd like to have it immutable. The trouble I have with that approach is that this would then take an implementation class in the interface instead of the abstraction. – Carsten Jul 27 '20 at 12:58
  • 1
    `Collection` may be the closest you get, better than `Sequence` since it doesn't require indexable. – levsa Jun 09 '21 at 13:58

2 Answers2

4

There are two natural ways of representing this that I can think of.

The first would be to use Iterable[str], and mention in the documentation, that Iterator and Generator objects should not be used since you may have multiple calls to __iter__. The whole point of Iterable is that you can get an iterator on it, and arguably it was a mistake to make Iterator support Iterable in the first place. It's not perfect, but is simple, which is usually more "pythonic" than a more technically correct annotation that is very complicated.

You can add some runtime checking that will alert the user that there is a problem if they pass the wrong thing:

iter1 = iter(input)
for item in iter1:
    do_something(item)
iter2 = iter(input)
if iter2 is iter1:
    raise ValueError("Must pass an iterable that can be iterated multiple times. Got {input}.")

Or check if you got Iterator, and handle it with a memory penalty:

if isinstance(input, Iterator):
    input = list(input)  # or itertools.tee or whatever
    warn("This may eat up a lot of memory")

The other option is to use io.TextIOBase. This can be iterated over multiple times by seeking to the beginning. This depends on your use case, and may not be a good fit. If conceptually the input is some kind of chunked view on a sequence of characters, then io streams are a good fit, even if the iterators don't technically return lines of text. If it's conceptually a sequence of strings which are aren't contiguous, then streams aren't a good fit.

Lucas Wiman
  • 10,021
  • 2
  • 37
  • 41
1

You could use a function which accepts no input and returns an iterable. In terms of typing hints, you would use a Callable.

From the documentation, if you are unfamiliar with Callable:

Frameworks expecting callback functions of specific signatures might be type hinted using Callable[[Arg1Type, Arg2Type], ReturnType].

Solution:

from typing import Callable, Iterable

def do_sth(get_input: Callable[[], Iterable[str]]) -> SomeResult:
    # ...
    pass

def main():
    do_sth(lambda : (str(i) for i in range(10)))

my function can iterate over it

def do_sth(get_input: Callable[[], Iterable[str]]) -> SomeResult:
    for item in get_input():
        pass

it's ok if my function maintains a reference to the input (e.g. by returning an object that keeps that reference)

Don't see why not.

def do_sth(get_input: Callable[[], Iterable[str]]) -> SomeResult:
    return dict(reference=get_input)

it's ok to iterate over the input more than once

def do_sth(get_input: Callable[[], Iterable[str]]) -> SomeResult:
    for i in range(10**82):
        for item in get_input():
            pass
oglehb
  • 38
  • 3