Dask Bag of dicts to Dask array

Question

I need to convert a dask.Bag of {'imgs': np.array(img_list), 'lables': np.array(label_list)} into two separate dask.Array-s. Why I created Bag instead of go directly for Array? Because I'm processing that Bag multiple times through map(); didn't manage to do the same with Array.

Now, the following code works for small datasets but clearly fails for bigger data.

images_array = da.from_array(np.array([item['images'] for item in imgs_labels_bag]), chunks=chunksize)
labels_array = da.from_array(np.array([item['labels'] for item in imgs_labels_bag]), chunks=chunksize)

How to do that without converting objs into numpy?

Ideas:

I've tried Bag -> Delayed -> Array but it didn't work because of something wrong with array shape.
An option might be to dump the Bag onto disk as text files as then read it as a dask.DataFrame/Array. Example: b_dict.map(json.dumps).to_textfiles("/path/to/data/*.json")
Instead of having a Bag of dicts I could have 2 Bags of np.array each and then try Bag -> Delayed -> Array.

Any other ideas?

score 1 · Accepted Answer · answered Nov 29 '17 at 04:50

1

If the item['images'] are 1D numpy arrays, and you want to tile them in the following way:

+---------------+
|item0['images']|
+---------------+
|item1['images']|
+---------------+
|item2['images']|
+---------------+

Then this can work (doc):

import dask.bag as db
import numpy as np
import dask.array as da
b = db.from_sequence([{'img':np.arange(10)}]*4)
s = da.stack([item['img'] for item in b], axis=0)
print(s.compute())

Result:

[[0 1 2 3 4 5 6 7 8 9]
 [0 1 2 3 4 5 6 7 8 9]
 [0 1 2 3 4 5 6 7 8 9]
 [0 1 2 3 4 5 6 7 8 9]]

answered Nov 29 '17 at 04:50

R zu

2,034
12
30

For each item, is item['images'] a dask array or a numpy array? – R zu Nov 29 '17 at 04:51
Each Bag item is a dict with 2 numpy.array. `item['images'].shape` is (160, 200, 200, 3) and `item['labels'].shape` is (160, ). – w00dy Nov 29 '17 at 08:38
1

Your solution works but I'm doubtful regarding its performance. If `b` has 1 million items the list comprehension will loop through 1m items in a sequential order right? – w00dy Nov 29 '17 at 11:08
I doubt if the performance will be good too. The list comprehension will definitely loop through the dictionaries. Hopefully not the arrays within the dictionary. I don't know enough about bags to speed this up. – R zu Nov 29 '17 at 15:24
If you have multiple arrays to calculate and they all depend on the same intermediate, you can calculate all of the arrays at the same time by da.compute(x, y, z,...) http://dask.pydata.org/en/latest/scheduler-overview.html That would avoid calculating the same intermediate multiple times. Not sure if the same can be done for bags. – R zu Nov 29 '17 at 16:04

MRocklin · Answer 2 · 2017-12-05T13:16:41.453

1

I recommend the following steps:

Making two bags of numpy arrays (you'll have to use map or pluck to pull out your images and labels values)
Using .map_partitions(np.stack) or .map_partitions(np.concatenate) (depending on the shapes you care about) to turn each of your partitions into a single numpy array
Turning your partitions into dask.delayed objects with .to_delayed
Turning each of these delayed objects into dask.arrays by calling dask.array.from_delayed on each one
Stacking or concatenating these dask arrays into a single dask.array using da.stack or da.concatenate

edited Dec 05 '17 at 13:16

answered Nov 29 '17 at 13:28

MRocklin

55,641
23
163
235

step (4) fails with `AttributeError: 'list' object has no attribute 'key'` when doing: `da.from_delayed(splitted_imgs_bag.to_delayed(), dataset_shape, dtype=np.uint8)`. By running `print(list(splitted_imgs_bag.to_delayed()))` it prints: `[Delayed(('pluck-5d54b90928938c55156c742068524a2d', 0))]`. It seems that `da.from_delayed` does not produce an object with a key attribute, as expected by `da.from_delayed`. – w00dy Dec 05 '17 at 12:38
da.from_delayed takes a single delayed object. You'll have to create many small dask arrays and then stack them together in step five. – MRocklin Dec 05 '17 at 13:17
Ok I see. Here's how I managed to do it: `images_array = da.concatenate([da.from_delayed(delayed(item[0]), shape=(nr_boxes_per_img, 200, 200, 3), dtype=np.uint8) for item in splitted_imgs_bag], axis=0)`. As I find it way too complex and still based on a list comprehension loop I'll select @r-zu question as the best one. Thanks a lot for your help. – w00dy Dec 06 '17 at 14:10
@MRocklin I'm trying to understand the RAM implications of this in a distributed setting. Suppose you have a bag of np arrays, each in their own partition, each residing on the resources of a different worker: `bag = db.from_sequence(inputs, npartitions=len(inputs)); list_of_delays = bag.map(returns_nparray).to_delayed(); list_of_darrays = [da.from_delayed(dd, my_shape, my_dtype) for dd in list_of_delays]; final_darray = da.stack(list_of_darrays, axis=0);` It should be the case that the array data never reduces to the primary process, is that correct? – NLi10Me Jul 14 '20 at 20:16

Dask Bag of dicts to Dask array

2 Answers2