9

I have a list, which is made up of the following elements,

list1 = [a1,a2,a3]

Where each element of this list can itself be a variable size list, eg,

a1 = [x1,y1,z1], a2 = [w2,x2,y2,z2], a3 = [p3,r3,t3,n3]

It's straight forward for me to set up a generator that loops through list1, and yields the constituents of each element;

array = []
for i in list1:
    for j in i:
        array.append[j]
        yield array

However, is there a way of doing this so I can specify the size of array?

eg - batch size of two;

1st yield : [x1,y1]
2nd yield : [z1,w1]
3rd yield : [x2,y2]
4th yield : [z2,p3]
5th yield : [r3,t3]
6th yield : [n3]
7th yield : repeat 1st

or batch size of 4;

1st yield : [x1,y1,z1,w1]
2nd yield : [x2,y2,z2,p3]
3rd yield : [r3,t3,n3]
4th yield : repeat first

It seems non-trivial to carry this out for different sized lists each containing other different sized lists inside.

obtmind
  • 287
  • 4
  • 12

4 Answers4

7

This is pretty easy, actually, use itertools:

>>> a1 = ['x1','y1','z1']; a2 = ['w2','x2','y2','z2']; a3 = ['p3','r3','t3','n3']
>>> list1 = [a1,a2,a3]
>>> from itertools import chain, islice
>>> flatten = chain.from_iterable
>>> def slicer(seq, n):
...     it = iter(seq)
...     return lambda: list(islice(it,n))
...
>>> def my_gen(seq_seq, batchsize):
...     for batch in iter(slicer(flatten(seq_seq), batchsize), []):
...         yield batch
...
>>> list(my_gen(list1, 2))
[['x1', 'y1'], ['z1', 'w2'], ['x2', 'y2'], ['z2', 'p3'], ['r3', 't3'], ['n3']]
>>> list(my_gen(list1, 4))
[['x1', 'y1', 'z1', 'w2'], ['x2', 'y2', 'z2', 'p3'], ['r3', 't3', 'n3']]

Note, we can use yield from in Python 3.3+:

>>> def my_gen(seq_seq, batchsize):
...   yield from iter(slicer(flatten(seq_seq), batchsize), [])
...
>>> list(my_gen(list1,2))
[['x1', 'y1'], ['z1', 'w2'], ['x2', 'y2'], ['z2', 'p3'], ['r3', 't3'], ['n3']]
>>> list(my_gen(list1,3))
[['x1', 'y1', 'z1'], ['w2', 'x2', 'y2'], ['z2', 'p3', 'r3'], ['t3', 'n3']]
>>> list(my_gen(list1,4))
[['x1', 'y1', 'z1', 'w2'], ['x2', 'y2', 'z2', 'p3'], ['r3', 't3', 'n3']]
>>>
juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172
  • How about if list1 is too big to fit in memory? my case they are actually images that I am sequentially loading – obtmind Jul 17 '17 at 22:58
  • @obtmind um, that seems like a wholly unrelated question. What data-structure is holding that information? You are using *lists*? You probably want some sort of array instead. Lists are very, very memory heavy. – juanpa.arrivillaga Jul 17 '17 at 23:00
  • 1
    @obtmind anyway, as long as you can keep the intermediate ones in memory, it shouldn't be too hard to adapt the above to use some sort of generator of these images, and pass *that* generator to as the `seq_seq` – juanpa.arrivillaga Jul 17 '17 at 23:01
6

You could use itertools here, in your case I would use chain and islice

import itertools
a1 = ['x1','y1','z1']
a2 = ['w2','x2','y2','z2'] 
a3 = ['p3','r3','t3','n3']
list1 = [a1,a2,a3]

def flatten_and_batch(lst, size):
    it = itertools.chain.from_iterable(lst)
    while True:
        res = list(itertools.islice(it, size))
        if not res:
            break
        else:
            yield res

list(flatten_and_batch(list1, 2))
# [['x1', 'y1'], ['z1', 'w2'], ['x2', 'y2'], ['z2', 'p3'], ['r3', 't3'], ['n3']]

list(flatten_and_batch(list1, 3))
# [['x1', 'y1', 'z1'], ['w2', 'x2', 'y2'], ['z2', 'p3', 'r3'], ['t3', 'n3']]

If you don't mind an additional dependency you could also use iteration_utilities.grouper (although it returns tuples not lists) 1 here:

from iteration_utilities import flatten, grouper, Iterable

>>> list(grouper(flatten(list1), 2))
[('x1', 'y1'), ('z1', 'w2'), ('x2', 'y2'), ('z2', 'p3'), ('r3', 't3'), ('n3',)]

>>> list(grouper(flatten(list1), 3))
[('x1', 'y1', 'z1'), ('w2', 'x2', 'y2'), ('z2', 'p3', 'r3'), ('t3', 'n3')]

or the iteration_utilities.Iterable:

>>> Iterable(list1).flatten().grouper(3).as_list()
[('x1', 'y1', 'z1'), ('w2', 'x2', 'y2'), ('z2', 'p3', 'r3'), ('t3', 'n3')]

>>> Iterable(list1).flatten().grouper(4).map(list).as_list()
[['x1', 'y1', 'z1', 'w2'], ['x2', 'y2', 'z2', 'p3'], ['r3', 't3', 'n3']]

1 Disclaimer: I'm the author of that library.


Timings:

enter image description here

from itertools import chain, islice
flatten = chain.from_iterable
from iteration_utilities import flatten, grouper, Iterable

def slicer(seq, n):
    it = iter(seq)
    return lambda: list(islice(it,n))

def my_gen(seq_seq, batchsize):
    for batch in iter(slicer(flatten(seq_seq), batchsize), []):
        yield batch

def flatten_and_batch(lst, size):
    it = flatten(lst)
    while True:
        res = list(islice(it, size))
        if not res:
            break
        else:
            yield res

def iteration_utilities_approach(seq, size):
    return grouper(flatten(seq), size)

def partition(lst, c):
    all_elem = list(chain.from_iterable(lst))
    for k in range(0, len(all_elem), c):
        yield all_elem[k:k+c]


def juanpa(seq, size):
    return list(my_gen(seq, size))    
def mseifert1(seq, size):
    return list(flatten_and_batch(seq, size))   
def mseifert2(seq, size):
    return list(iteration_utilities_approach(seq, size))   
def JoelCornett(seq, size):
    return list(partition(seq, size))       

# Timing setup
timings = {juanpa: [], 
           mseifert1: [], 
           mseifert2: [], 
           JoelCornett: []}

sizes = [2**i for i in range(1, 18, 2)]

# Timing
for size in sizes:
    print(size)
    func_input = [['x1','y1','z1']]*size
    for func in timings:
        print(str(func))
        res = %timeit -o func(func_input, 3)
        timings[func].append(res)

%matplotlib notebook

import matplotlib.pyplot as plt
import numpy as np

fig = plt.figure(1)
ax = plt.subplot(111)

for func in timings:
    ax.plot(sizes, 
            [time.best for time in timings[func]], 
            label=str(func.__name__))
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('size')
ax.set_ylabel('time [seconds]')
ax.grid(which='both')
ax.legend()
plt.tight_layout()
MSeifert
  • 145,886
  • 38
  • 333
  • 352
  • Good test! I did not notice your second solution! – AGN Gazer Jul 17 '17 at 23:45
  • @AGNGazer But the second solution requires an external library and returns a list of tuples instead of a list of lists. That's why it's only solution 2. :) – MSeifert Jul 17 '17 at 23:51
  • Well, then you either rewrite your library to return lists or I will take my vote back :) – AGN Gazer Jul 17 '17 at 23:54
  • Now seriously, @JoelCornett is truly a fast "general" solution. – AGN Gazer Jul 17 '17 at 23:56
  • 2
    The `iteration_utilities_approach` solution is still faster even if wrapped in a `map(list, ...)` (which returns the list of lists). However, there's one thing that's a bit of a "problem" with the `list(chain(...))` approach - it creates a complete list in memory while all other approaches are lazy. While that makes the solution very fast, it also makes it very memory-expensive. – MSeifert Jul 17 '17 at 23:59
3

It is relatively trivial if you break the task into two steps:

  1. Flatten the list.
  2. Emit chunks based on batch size.

Here is an example implementation:

from itertools import chain

def break_into_batches(items, batch_size):
    flattened = list(chain(*items))
    for i in range(0, len(flattened), batch_size):
        yield flattened[i:i+batch_size]
Joel Cornett
  • 24,192
  • 9
  • 66
  • 88
  • I would like to suggest that you add performance tests to your answer. – AGN Gazer Jul 17 '17 at 23:38
  • I am getting: `%timeit list(break_into_batches(arr, 4))`: 1000 loops, best of 3: **950 µs** per loop, `%timeit list(my_gen(arr, 4))`: 100 loops, best of 3: **3.4 ms** per loop, and `%timeit list(flatten_and_batch(arr, 4))`: 100 loops, best of 3: **3.12 ms** per loop. These tests were done using arr = `import random; [[random.randint(1, 100) for i in range(random.randint(1, 100))] for k in range(300)]` – AGN Gazer Jul 17 '17 at 23:40
1

Given the following objectives applied to a list

  1. yield batches, each of a given size
  2. repeat this process some number of cycles

more_itertools can achieve these objectives as follows:

import more_itertools as mit


def batch(iterable, size=2, cycles=1):
    """Yield resized batches of an iterable."""
    iterable = mit.ncycles(iterable, cycles)
    return mit.chunked(mit.flatten(iterable), size)

list(batch(list1, 3))
# [["x1", "y1", "z1"], ["w2", "x2", "y2"], ["z2", "p3", "r3"], ["t3", "n3"]]


list(batch(list1, size=3, cycles=2))
# [["x1", "y1", "z1"], ["w2", "x2", "y2"], ["z2", "p3", "r3"],
#  ["t3", "n3", "x1"], ["y1", "z1", "w2"], ["x2", "y2", "z2"],
#  ["p3", "r3", "t3"], ["n3"]]

See docs for details on each tool ncycles, flatten and chucked.

pylang
  • 40,867
  • 14
  • 129
  • 121