6

I want to chunk an input stream for batch processing. Given an input list or generator,

x_in = [1, 2, 3, 4, 5, 6 ...]

I want a function that will return chunks of that input. Say, if chunk_size=4, then,

x_chunked = [[1, 2, 3, 4], [5, 6, ...], ...]

This is something I do over and over, and was wondering if there is a more standard way than writing it myself. Am I missing something in itertools? (One could solve the problem with enumerate and groupby, but that feels clunky.) In case anyone wants to see an implementation, here it is,

def chunk_input_stream(input_stream, chunk_size):
    """partition a generator in a streaming fashion"""
    assert chunk_size >= 1
    accumulator = []
    for x in input_stream:
        accumulator.append(x)
        if len(accumulator) == chunk_size:
            yield accumulator
            accumulator = []
    if accumulator:
        yield accumulator

Edit

Inspired by kreativitea's answer, here's a solution with islice, which is straightforward & doesn't require post-filtering,

from itertools import islice

def chunk_input_stream(input_stream, chunk_size):
    while True:
        chunk = list(islice(input_stream, chunk_size))
        if chunk:
            yield chunk
        else:
            return

# test it with list(chunk_input_stream(iter([1, 2, 3, 4]), 3))
gatoatigrado
  • 16,580
  • 18
  • 81
  • 143

3 Answers3

6

The recipe from itertools:

def grouper(n, iterable, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)
Jon Clements
  • 138,671
  • 33
  • 247
  • 280
  • And, naturally, note [the slight difference of ``zip_longest`` for 3.x](http://docs.python.org/3/library/itertools.html#itertools-recipes). – Gareth Latty Nov 05 '12 at 20:14
  • 1
    Can't you use `itertools.repeat` instead of `[]*n`? – jpm Nov 05 '12 at 20:14
  • 1
    This will require a little tweaking for the OP's case-- the OP's code doesn't fill, IIUC. – DSM Nov 05 '12 at 20:14
  • @DSM Then filter ``None``, or add a sentinel value if you need ``None``s. – Gareth Latty Nov 05 '12 at 20:15
  • 1
    @jpm As the whole thing will be exhausted straight away into the ``zip_longest()`` call as arguments, I imagine the overhead from the generator will make it slower than the list multiplication. This way is simpler and probably faster. – Gareth Latty Nov 05 '12 at 20:16
  • For my particular purpose, I'd need to filter out the None's, but that's not a big deal. This solution is nice. On the downside, however, it's a little "clever" -- it's not immediately apparent what's going on, and it's not clear that cleverness buys you anything. If grouper was part of `itertools`, that'd be wonderful; since it's not, I'm a little more hesitant about using it vs. my straightforward solution. – gatoatigrado Nov 05 '12 at 20:27
  • To elaborate on what seems "clever", it's really important to know that `[iter(iterable)] * n` != `[iter(iterable), iter(iterable), ...]`. This is not obvious at first glance. – gatoatigrado Nov 05 '12 at 20:31
  • @Lattyware Unless I'm just completely misreading the izip_longest (or zip_longest for 3.x), it doesn't appear to consume the iterables until the actual iteration takes place. (This is why you can call it with infinite iterables, as long as you don't later iterate infinitely.) – jpm Nov 06 '12 at 01:06
  • @gatoatigrado If you have it as a function then people can read the comment and see the function name. This is the common way of doing it, and while it's non-obvious, it's clear, simple and fast once you grasp the concept. – Gareth Latty Nov 06 '12 at 11:02
  • @jpm No, but the generator here (``itertools.repeat(iter(iterable))``) would be the arguments to ``izip_longest()``/``zip_longest``, and would be consumed by the ``*`` operator to produce the arguments for the function before it even ran. – Gareth Latty Nov 06 '12 at 11:04
  • @Lattyware, TBH your comment comes off as a little patronizing, and a little obtuse as I use a docstring in the question's example code. Taking it generously, I agree names & docstrings are good "go here first" documentation, but sometimes it's even faster for me to read code, if that code is super-clear. I'll echo what my boss said: it's nice to put the clever parts where you need them, and let everything else (like this helper function) be straightforward. Of course, copying a standard solution is usually OK, but in other cases, I think strightforward code is less buggy, too. – gatoatigrado Nov 07 '12 at 07:37
  • 1
    @gatoatigrado There was no intent to be patronizing. I really see no solution here that reads any better than this solution, to be honest. – Gareth Latty Nov 07 '12 at 10:57
  • @Lattyware, sorry for misinterpreting then. In case you didn't notice, I explained above why I didn't like this solution. I suppose my preference against this solution partly comes from having spent some time in FP languages, where it's safe to assume `repeat 3 x` is the same as [x, x, x]. Python has this too for immutable values. If the solution were written, `input_iterator = iter(iterable); args = [input_iter] * n`, I'd like it a little better. But why don't you like the first solution? I think it's even more elegant and answers the question precisely. – gatoatigrado Nov 07 '12 at 17:16
4

[Updated version thanks to the OP: I've been throwing yield from at everything in sight since I upgraded and it didn't even occur to me that I didn't need it here.]

Oh, what the heck:

from itertools import takewhile, islice, count

def chunk(stream, size):
    return takewhile(bool, (list(islice(stream, size)) for _ in count()))

which gives:

>>> list(chunk((i for i in range(3)), 3))
[[0, 1, 2]]
>>> list(chunk((i for i in range(6)), 3))
[[0, 1, 2], [3, 4, 5]]
>>> list(chunk((i for i in range(8)), 3))
[[0, 1, 2], [3, 4, 5], [6, 7]]

Warning: the above suffers the same problem as the OP's chunk_input_stream if the input is a list. You could get around this with an extra iter() wrap but that's less pretty. Conceptually, using repeat or cycle might make more sense than count() but I was character-counting for some reason. :^)

[FTR: no, I'm still not entirely serious about this, but hey-- it's a Monday.]

DSM
  • 342,061
  • 65
  • 592
  • 494
  • You don't need the `yield from`; this can work perfectly well in Python 2.x if you just `return takewhile...`. Make that edit and I'll mark this as the correct answer. Also, you might include the import line for completeness `from itertools import takewhile, islice, count`. Your solution is concise, actually pretty straightforward (see my comments as to why Jon's is not), and works -- Thank you!! – gatoatigrado Nov 05 '12 at 21:35
1

Is there any reason you're not using something like this?:

# data is your stream, n is your chunk length
[data[i:i+n] for i in xrange(0,len(data),n)]

edit:

Since people are making generators....

def grouper(data, n):
    results = [data[i:i+n] for i in xrange(0,len(data),n)]
    for result in results:
        yield result

edit 2:

I was thinking, if you have the input stream in memory as a deque, you can .popleft very efficiently to yield n number of objects.

from collections import deque
stream = deque(data)

def chunk(stream, n):
    """ Returns the next chunk from a data stream. """
    return [stream.popleft() for i in xrange(n)]

def chunks(stream, n, reps):
    """ If you want to yield more than one chunk. """
    for item in [chunk(stream, n) for i in xrange(reps)]:
        yield item
kreativitea
  • 1,741
  • 12
  • 14