3

Context: Dask documentation states clearly that Bag.take() will only collect from the first partition. However, when using a filter it can occur that the first partition is empty, while others are not.

Question: Is it possible to use Bag.take() so that it collects from a sufficient number of partitions to collect the n items (or the maximum available less than than n).

JMann
  • 579
  • 4
  • 12

1 Answers1

1

You could do something like the following:

from toolz import take
f = lambda seq: list(take(n, seq))
b.reduction(f, f)

This grabs the first n elements of each partition, collects them all together, and then takes the first n elements of the result.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • this didn't work! using tools version 0.8.0, dask 0.10, python 3.5.2 `tb = db.from_sequence(range(20),npartitions = 4)` `tb.reduction(take(2),take(2)` gives PicklingError: Cannot pickle objects of type - I do appreciate your help.. – JMann Jul 08 '16 at 00:15
  • Ah, pickling is odd. Perhaps use `f = lambda seq: list(take(n, seq))` and `b.reduction(f, f)` instead. – MRocklin Jul 08 '16 at 05:10
  • this did it. and, yes I've read elsewhere of the charms of pickle, alarmingly it was from a post now several years old. – JMann Jul 11 '16 at 02:47