4

I need a function to split an iterable into chunks with the option of having an overlap between the chunks.

I wrote the following code, which gives me the correct output but that is quite inefficient (slow). I can't figure out how to speed it up. Is there a better method?

def split_overlap(seq, size, overlap):
    '''(seq,int,int) => [[...],[...],...]
    Split a sequence into chunks of a specific size and overlap.
    Works also on strings! 

    Examples:
        >>> split_overlap(seq=list(range(10)),size=3,overlap=2)
        [[0, 1, 2], [1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6], [5, 6, 7], [6, 7, 8], [7, 8, 9]]

        >>> split_overlap(seq=range(10),size=3,overlap=2)
        [range(0, 3), range(1, 4), range(2, 5), range(3, 6), range(4, 7), range(5, 8), range(6, 9), range(7, 10)]

        >>> split_overlap(seq=list(range(10)),size=7,overlap=2)
        [[0, 1, 2, 3, 4, 5, 6], [5, 6, 7, 8, 9]]
    '''
    if size < 1 or overlap < 0:
        raise ValueError('"size" must be an integer with >= 1 while "overlap" must be >= 0')
    result = []
    while True:
        if len(seq) <= size:
            result.append(seq)
            return result
        else:
            result.append(seq[:size])
            seq = seq[size-overlap:]

Testing results so far:

l = list(range(10))
s = 4
o = 2
print(split_overlap(l,s,o))
print(list(split_overlap_jdehesa(l,s,o)))
print(list(nwise_overlap(l,s,o)))
print(list(split_overlap_Moinuddin(l,s,o)))
print(list(gen_split_overlap(l,s,o)))
print(list(itr_split_overlap(l,s,o)))

[[0, 1, 2, 3], [2, 3, 4, 5], [4, 5, 6, 7], [6, 7, 8, 9]]
[(0, 1, 2, 3), (2, 3, 4, 5), (4, 5, 6, 7), (6, 7, 8, 9)]
[(0, 1, 2, 3), (2, 3, 4, 5), (4, 5, 6, 7), (6, 7, 8, 9), (8, 9, None, None)] #wrong
[[0, 1, 2, 3], [2, 3, 4, 5], [4, 5, 6, 7], [6, 7, 8, 9], [8, 9]] #wrong
[[0, 1, 2, 3], [2, 3, 4, 5], [4, 5, 6, 7], [6, 7, 8, 9]]
[(0, 1, 2, 3), (2, 3, 4, 5), (4, 5, 6, 7), (6, 7, 8, 9)]

%%timeit
split_overlap(l,7,2)
718 ns ± 2.36 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%%timeit
list(split_overlap_jdehesa(l,7,2))
4.02 µs ± 64.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%%timeit
list(nwise_overlap(l,7,2))
5.05 µs ± 102 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%%timeit
list(split_overlap_Moinuddin(l,7,2))
3.89 µs ± 78.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%%timeit
list(gen_split_overlap(l,7,2))
1.22 µs ± 13.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%%timeit
list(itr_split_overlap(l,7,2))
3.41 µs ± 36.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

With longer list as input:

l = list(range(100000))

%%timeit
split_overlap(l,7,2)
4.27 s ± 132 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
list(split_overlap_jdehesa(l,7,2))
31.1 ms ± 495 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit
list(nwise_overlap(l,7,2))
5.74 ms ± 66 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
list(split_overlap_Moinuddin(l,7,2))
16.9 ms ± 89.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
list(gen_split_overlap(l,7,2))
4.54 ms ± 22.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
list(itr_split_overlap(l,7,2))
19.1 ms ± 240 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

From other tests (not reported here), it turned out that for small lists len(list) <= 100, my original implementation split_overlap() is the fastest. But for anything larger than that, gen_split_overlap() is the most efficient solution so far.

alec_djinn
  • 10,104
  • 8
  • 46
  • 71
  • 6
    Do you actually need to build and return a list of lists? The pattern in `itertools` is to return an iterator of tuples, and evaluate lazily. Anyway, if this is **working code** that you think could be improved, maybe see [codereview.se]. – jonrsharpe Jan 22 '18 at 12:43
  • An iterator would be fine as well – alec_djinn Jan 22 '18 at 12:44
  • Actually it's not quite working, at least not as documented, because it only works on *sequences*, not iterables. – jonrsharpe Jan 22 '18 at 12:45
  • range() is an iterable – alec_djinn Jan 22 '18 at 12:46
  • That's true, but not all things that are iterable support `[:size]`. Sequences are iterable, but not all iterables are sequences. – jonrsharpe Jan 22 '18 at 12:46
  • ["... `range` is actually an immutable sequence type, ..."](https://docs.python.org/3/library/functions.html#func-range) – Ilja Everilä Jan 22 '18 at 12:47
  • @jonrsharpe Yep! Good point. One more thing to fix! – alec_djinn Jan 22 '18 at 12:47
  • This should probably be on Code Review, not SO. – internet_user Jan 22 '18 at 12:48
  • I'd recommend you spend some time looking at e.g. https://docs.python.org/3/library/itertools.html#itertools-recipes and think about how you could rewrite this to be more general. I don't think this is specific enough for SO, or working enough for CR. – jonrsharpe Jan 22 '18 at 12:49
  • I have corrected the description. Let's focus on sequences as input for now. It can return an iterable. – alec_djinn Jan 22 '18 at 12:50
  • `seq = seq[size-overlap:]` is going to spend a lot of time making copies (given large sequences). Perhaps use indices instead and copy just what you need to yield. – Ilja Everilä Jan 22 '18 at 12:51
  • 1
    When you say "quite inefficient," what do you mean? – Ned Batchelder Jan 22 '18 at 12:51
  • @jonrsharpe I thought it was not good enough for CR as well, that is why I am here. – alec_djinn Jan 22 '18 at 12:51
  • @NedBatchelder It takes too much time. I was looking for something faster. The memory usage may also be a problem in case of long lists (500k items or more). But those are my main inputs and I can't change it now. I can access a large quantity of memory (64 or 128 GB) so I wasn't worried about memory at first. – alec_djinn Jan 22 '18 at 12:54
  • @alec_djinn well I must say i'm disappointed by the lack of performance the complete `itertools` (nwise) solution had. Nevermind – James Schinner Jan 22 '18 at 13:43
  • I'm glad I seem to be getting a good score, but just to be sure, did you wrap my function in a `list`? It is a generator, so just calling the function will only create the generator but not actually do the work. – jdehesa Jan 22 '18 at 13:44
  • @jdehesa good point. I didn't. I'll fix it and update the results – alec_djinn Jan 22 '18 at 13:46
  • You should push the limits and test on a list of at least 100000 items. Your original code crunches a million item list for ~2 minutes on this machine. – Ilja Everilä Jan 22 '18 at 14:07
  • @IljaEverilä running it now.. will update soon – alec_djinn Jan 22 '18 at 14:08
  • @alec_djinn you don't need to type-cast using `list` in my first solution which will further improve it's performance ;) (for small lists), *(whereas second solution of mine will work well for huge list)* – Moinuddin Quadri Jan 22 '18 at 14:10
  • @MoinuddinQuadri I am quite puzzled by the results. Am I running the tests in the wrong way? – alec_djinn Jan 22 '18 at 14:15
  • @alec_djinn Not completely wrong. In first solution of mine you should be doing the timeit of `split_overlap_Moinuddin(l,7,2)` instead of `list(split_overlap_Moinuddin(l,7,2))`, because it is already a list. And due to huge size, `list(...)` creates a new list increasing the time. However in mine second solution you need to type-cast it using `list(..)` as it yields `generator` – Moinuddin Quadri Jan 22 '18 at 14:20
  • Wow, that's a nice unexpected result – James Schinner Jan 22 '18 at 14:38
  • @JamesSchinner Yes, really unexpected also for me. – alec_djinn Jan 22 '18 at 14:39

4 Answers4

6

Sometimes readability counts vs. speed. A simple generator that iterates over indices, producing slices gets the job done in reasonable time:

def gen_split_overlap(seq, size, overlap):        
    if size < 1 or overlap < 0:
        raise ValueError('size must be >= 1 and overlap >= 0')

    for i in range(0, len(seq) - overlap, size - overlap):            
        yield seq[i:i + size]

If you want to handle potentially infinite iterables, you just have to keep overlap items from the previous yield and slice size - overlap new items:

def itr_split_overlap(iterable, size, overlap):
    itr = iter(iterable)

    # initial slice, in case size exhausts iterable on the spot
    next_ = tuple(islice(itr, size))
    yield next_
    # overlap for initial iteration
    prev = next_[-overlap:] if overlap else ()

    # For long lists the repeated calls to a lambda are slow, but using
    # the 2-argument form of `iter()` is in general a nice trick.
    #for chunk in iter(lambda: tuple(islice(itr, size - overlap)), ()):

    while True:
        chunk = tuple(islice(itr, size - overlap))

        if not chunk:
            break

        next_ = (*prev, *chunk)
        yield next_

        # overlap == 0 is a special case
        if overlap:
            prev = next_[-overlap:]
Ilja Everilä
  • 50,538
  • 7
  • 126
  • 127
4

If it is must to meet the criterion of the chunk size (and discard remaining chunks from end not meeting the chunk size criteria)

You can create you custom function using zip and a list comprehension to achieve this as:

def split_overlap(seq, size, overlap):
     return [x for x in zip(*[seq[i::size-overlap] for i in range(size)])]

Sample Run:

# Chunk size: 3
# Overlap: 2 
>>> split_overlap(list(range(10)), 3, 2)
[(0, 1, 2), (1, 2, 3), (2, 3, 4), (3, 4, 5), (4, 5, 6), (5, 6, 7), (6, 7, 8), (7, 8, 9)]

# Chunk size: 3
# Overlap: 1
>>> split_overlap(list(range(10)), 3, 1)
[(0, 1, 2), (2, 3, 4), (4, 5, 6), (6, 7, 8)]

# Chunk size: 4
# Overlap: 1
>>> split_overlap(list(range(10)), 4, 1)
[(0, 1, 2, 3), (3, 4, 5, 6), (6, 7, 8, 9)]

# Chunk size: 4
# Overlap: 2
>>> split_overlap(list(range(10)), 4, 2)
[(0, 1, 2, 3), (2, 3, 4, 5), (4, 5, 6, 7), (6, 7, 8, 9)]

# Chunk size: 4
# Overlap: 1
>>> split_overlap(list(range(10)), 4, 3)
[(0, 1, 2, 3), (1, 2, 3, 4), (2, 3, 4, 5), (3, 4, 5, 6), (4, 5, 6, 7), (5, 6, 7, 8), (6, 7, 8, 9)]

If remaining chunks from the end not meeting the criteria of chunk size are also desired

If you want to display the chunks even if the doesn't meet the pre-requisite of the chunk size, then you should be using the itertools.zip_longest in Python 3.x (which is equivalent of itertools.izip_longest in Python 2.x).

Also, this is variant to yield the values dynamically, which is more efficient in terms of memory in case you have huge list:

# Python 3.x
from itertools import zip_longest as iterzip

# Python 2.x
from itertools import izip_longest as iterzip

# Generator function
def split_overlap(seq, size, overlap):
    for x in iterzip(*[my_list[i::size-overlap] for i in range(size)]):
        yield tuple(i for i in x if i!=None) if x[-1]==None else x
        #      assuming that your initial list is  ^
        #      not containing the `None`, use of `iterzip` is based
        #      on the same assumption  

Sample Run:

#     v  type-cast to list in order to display the result, 
#     v  not required during iterations
>>> list(split_overlap(list(range(10)),7,2))
[[0, 1, 2, 3, 4, 5, 6], [5, 6, 7, 8, 9]]
Moinuddin Quadri
  • 46,825
  • 13
  • 96
  • 126
  • I like that one! – James Schinner Jan 22 '18 at 13:20
  • Though it doesn't return the expected result, with an uneven split. `list(split_overlap([0]*10,7,2)) == [(0, 0, 0, 0, 0, 0, 0)]` **!=** `[(0, 0, 0, 0, 0, 0, 0), (0, 0, 0, 0, 0, 0, 0)]` – James Schinner Jan 22 '18 at 13:33
  • @JamesSchinner Correct, because the next chunk doesn't meet the pre-requisite of the size to be `7` in this case. After getting the first chunk, remaining elements will be `3` and with allowed over lap of `2`, elements eligible for second chunk are `5`. But since the required chunk size is `7`, it is skipped – Moinuddin Quadri Jan 22 '18 at 13:39
  • I only mention it because, OP made the same comment about my answer. Easy fix though, `zip_longest` – James Schinner Jan 22 '18 at 13:40
  • @JamesSchinner In that case OP is wrong about the their comment. It is a expected behavior for the program based on the requirement mentioned in the question. – Moinuddin Quadri Jan 22 '18 at 13:43
1

Your approach is about as good as it will get, you need to poll the sequence/iterable and build the chunks, but in any case, here is a lazy version that works with iterables and uses a deque for performance:

from collections import deque

def split_overlap(iterable, size, overlap=0):
    size = int(size)
    overlap = int(overlap)
    if size < 1 or overlap < 0 or overlap >= size:
        raise ValueError()
    pops = size - overlap
    q = deque(maxlen=size)
    for elem in iterable:
        q.append(elem)
        if len(q) == size:
            yield tuple(q)
            for _ in range(pops):
                q.popleft()
    # Yield final incomplete tuple if necessary
    if len(q) > overlap:
        yield tuple(q)

>>> list(split_overlap(range(10), 4, 2))
[(0, 1, 2, 3), (3, 4, 5, 6), (6, 7, 8, 9)]
>>> list(split_overlap(range(10), 5, 2))
[(0, 1, 2, 3, 4), (3, 4, 5, 6, 7), (6, 7, 8, 9)]

Note: as it is, the generator yields one last incomplete tuple if the input does not produce an exact number of chunks (see second example). If you want to avoid this remove the final if len(q) > overlap: yield tuple(q).

jdehesa
  • 58,456
  • 7
  • 77
  • 121
0

you can try using

itertools.izip(...)

which is good for large lists, because it returns an iterator instead of a list.

like this:

import itertools
def split_overlap(iterable, size, overlap):
    '''(iter,int,int) => [[...],[...],...]
    Split an iterable into chunks of a specific size and overlap.
    Works also on strings! 

    Examples:
        >>> split_overlap(iterable=list(range(10)),size=3,overlap=2)
        [[0, 1, 2], [1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6], [5, 6, 7], [6, 7, 8], [7, 8, 9]]

        >>> split_overlap(iterable=range(10),size=3,overlap=2)
        [range(0, 3), range(1, 4), range(2, 5), range(3, 6), range(4, 7), range(5, 8), range(6, 9), range(7, 10)]
    '''
    if size < 1 or overlap < 0:
        raise ValueError('"size" must be an integer with >= 1 while "overlap" must be >= 0')
    result = []
    for i in itertools.izip(*[iterable[i::size-overlap] for i in range(size)]):
        result.append(i)
    return result
Aviad Levy
  • 750
  • 4
  • 13