Count runs lazily in Python translate from Haskell

Question

I'm trying to write a generator function (or achieve the equivalent) which takes an iterable xs in Python and counts the "runs". (This is a problem in Thinking Functionally with Haskell by Bird, which I want to translate into Python using Python's laziness features.) So

list(iter(count_runs(['a', 'a', 'b', 'c', 'a', 'd', 'd'])))
# => [(2, 'a'), (1, 'b'), (1, c'), (1, 'a'), (2, 'd')]

In Haskell it's

countRuns :: [a] -> [(Int, a)]
countRuns [] = []
countRuns x:xs = (1 + length us, x):countRuns vs
               where us, vs = span (==x) xs

In Python, I'd like to write somethin like

from itertools import takewhile, dropwhile

def count_runs(xs):
  # get first element x of xs, if it exists
  us, vs = (takewhile(lambda y: y==x, xs), 
            dropwhile(lambda y: y==x, xs))
  yield (1 + len(list(us)), x)
  yield from count_runs(vs)

But the problem is that vs is an iterator already, so I will run into trouble if I call takewhile and dropwhile on it in the next recursion. (When I call list(takewhile(..., xs)) in the next recursion, it will get rid of the first element of dropwhile(..., xs) as well, because they're both looking at the same iterator.

How can I fix this issue, and what is the correct way to get the first element in the second line?

You don't really need `takewhile` and `dropwhile`, because Python iterators are mutable. However, `takewhile` also isn't using pattern-matching, so it will consume a non-`x` value from `xs` in order to stop yielding values. — chepner, Oct 21 '21 at 14:19
`itertools.groupby` already does most of the work you need for this. — chepner, Oct 21 '21 at 14:21
I guess I'm a bit confused. If the problem is that doing `takewhile` will cause `dropwhile` to start where you want the next `takewhile` to start, then just... don't call `dropwhile` at all? `us = takewhile(lambda y: y==x, xs); yield (1+len(list(us)), x); yield from count_runs(xs)` (Not putting this in an answer because I don't know Python and so there could be a subtle reason not to do this.) — Daniel Wagner, Oct 21 '21 at 15:47

score 1 · Accepted Answer · edited Oct 27 '21 at 14:33

A significant difference between span and takewhile is that takewhile consumes the first non-x value in order to determine when to stop yielding values. As a result, you'll lose any singleton items in the input; in particular, takewhile loses the first b in producing the leading set of as. The iterator protocol has no way of peeking at the next element of an iterator, nor does it have a way to put back an element it consumes.

Instead, you'll need two independent iterators: one for takewhile to produce the desired prefix, and another from which to drop that prefix for the recursive call.

def count_runs(xs):
    try:
        x = next(xs)
    except StopIteration:
        return

    t1, t2 = tee(xs)

    us = list(takewhile(lambda y: y == x, t1))
    yield (1 + len(us), x)
    yield from count_runs(dropwhile(lambda y: y == x, t2))

(Note that the itertools documentation implements something similar to span in its recipe section as a before_and_after function. It doesn't use tee, but I refer you to the actual implementation for details).

def before_and_after(xs):
    ...

def count_runs(xs):
    try:
        x = next(xs)
    except StopIteration:
        return

    first, second = before_and_after(lambda y: y == x, xs)
    yield (1 + len(list(first)), x)
    yield from count_runs(second)

)

However, most of this work is already done for you by itertools.groupby.

def count_runs(xs):
    yield from ((len(list(v)), k) for k, v in groupby(xs))

If you want to avoid creating lists just to take their length, you could replace that bit with `sum(1 for _ in v)` — kindall, Oct 21 '21 at 16:31

Count runs lazily in Python translate from Haskell

1 Answers1