Grouping dask.bag items into distinct partitions

Question

I was wondering if somebody could help me understand the way Bag objects handle partitions. Put simply, I am trying to group items currently in a Bag so that each group is in its own partition. What's confusing me is that the Bag.groupby() method asks for a number of partitions. Shouldn't this be implied by the grouping function? E.g., two partitions if the grouping function returns a boolean?

>>> a = dask.bag.from_sequence(range(20), npartitions = 1)
>>> a.npartitions
1
>>> b = a.groupby(lambda x: x % 2 == 0)
>>> b.npartitions
1

I'm obviously missing something here. Is there a way to group Bag items into separate partitions?

score 0 · Answer 1 · answered Feb 22 '17 at 17:37

0

Dask bag may put several groups within one partition.

In [1]: import dask.bag as db

In [2]: b = db.range(10, npartitions=3).groupby(lambda x: x % 5)

In [3]: partitions = b.to_delayed()

In [4]: partitions
Out[4]: 
[Delayed(('groupby-collect-f00b0aed94fd394a3c61602f5c3a4d42', 0)),
 Delayed(('groupby-collect-f00b0aed94fd394a3c61602f5c3a4d42', 1)),
 Delayed(('groupby-collect-f00b0aed94fd394a3c61602f5c3a4d42', 2))]

In [5]: for part in partitions:
   ...:     print(part.compute())
   ...:     
[(0, [0, 5]), (3, [3, 8])]
[(1, [1, 6]), (4, [4, 9])]
[(2, [2, 7])]

answered Feb 22 '17 at 17:37

MRocklin

55,641
23
163
235

1

Indeed, that's what confusing me. How can I have each group in its own partition? Can I re-shuffle the (and probably change the number of) partitions? – ajmazurie Feb 22 '17 at 22:06
Trying to isolate each group to a separate partition sounds odd to me. Perhaps there is another way to accomplish what you are trying to do. Can you explain why this is your objective in your original question? – MRocklin Feb 23 '17 at 13:37
A typical use case is having partitions represent distinct subsets of the overall dataset, to be handled independently (while obviously sharing the same schema across partitions). Right now partitions are only presented as internal artifacts used to optimize parallelization, versus the Apache Spark approach where they can also represent logical subsets of a bigger dataset. As an example, one can consider timestamped `dask.bag` items, which you group into partitions based on them sharing the same date. Having one partition per day facilitates downstream analyses and storage. – ajmazurie Feb 24 '17 at 04:10
If your data fits the Pandas model then you might want to try dask.dataframe, which has a proper index that would sort data accordingly. – MRocklin Feb 24 '17 at 12:35

Grouping dask.bag items into distinct partitions

1 Answers1