3

I need to convert a dask.Bag of {'imgs': np.array(img_list), 'lables': np.array(label_list)} into two separate dask.Array-s. Why I created Bag instead of go directly for Array? Because I'm processing that Bag multiple times through map(); didn't manage to do the same with Array.

Now, the following code works for small datasets but clearly fails for bigger data.

images_array = da.from_array(np.array([item['images'] for item in imgs_labels_bag]), chunks=chunksize)
labels_array = da.from_array(np.array([item['labels'] for item in imgs_labels_bag]), chunks=chunksize)

How to do that without converting objs into numpy?

Ideas:

  1. I've tried Bag -> Delayed -> Array but it didn't work because of something wrong with array shape.

  2. An option might be to dump the Bag onto disk as text files as then read it as a dask.DataFrame/Array. Example: b_dict.map(json.dumps).to_textfiles("/path/to/data/*.json")

  3. Instead of having a Bag of dicts I could have 2 Bags of np.array each and then try Bag -> Delayed -> Array.

Any other ideas?

w00dy
  • 748
  • 1
  • 6
  • 23

2 Answers2

1

If the item['images'] are 1D numpy arrays, and you want to tile them in the following way:

+---------------+
|item0['images']|
+---------------+
|item1['images']|
+---------------+
|item2['images']|
+---------------+    

Then this can work (doc):

import dask.bag as db
import numpy as np
import dask.array as da
b = db.from_sequence([{'img':np.arange(10)}]*4)
s = da.stack([item['img'] for item in b], axis=0)
print(s.compute())

Result:

[[0 1 2 3 4 5 6 7 8 9]
 [0 1 2 3 4 5 6 7 8 9]
 [0 1 2 3 4 5 6 7 8 9]
 [0 1 2 3 4 5 6 7 8 9]]
R zu
  • 2,034
  • 12
  • 30
  • For each item, is item['images'] a dask array or a numpy array? – R zu Nov 29 '17 at 04:51
  • Each Bag item is a dict with 2 numpy.array. `item['images'].shape` is (160, 200, 200, 3) and `item['labels'].shape` is (160, ). – w00dy Nov 29 '17 at 08:38
  • 1
    Your solution works but I'm doubtful regarding its performance. If `b` has 1 million items the list comprehension will loop through 1m items in a sequential order right? – w00dy Nov 29 '17 at 11:08
  • I doubt if the performance will be good too. The list comprehension will definitely loop through the dictionaries. Hopefully not the arrays within the dictionary. I don't know enough about bags to speed this up. – R zu Nov 29 '17 at 15:24
  • If you have multiple arrays to calculate and they all depend on the same intermediate, you can calculate all of the arrays at the same time by da.compute(x, y, z,...) http://dask.pydata.org/en/latest/scheduler-overview.html That would avoid calculating the same intermediate multiple times. Not sure if the same can be done for bags. – R zu Nov 29 '17 at 16:04
1

I recommend the following steps:

  1. Making two bags of numpy arrays (you'll have to use map or pluck to pull out your images and labels values)
  2. Using .map_partitions(np.stack) or .map_partitions(np.concatenate) (depending on the shapes you care about) to turn each of your partitions into a single numpy array
  3. Turning your partitions into dask.delayed objects with .to_delayed
  4. Turning each of these delayed objects into dask.arrays by calling dask.array.from_delayed on each one
  5. Stacking or concatenating these dask arrays into a single dask.array using da.stack or da.concatenate
MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • step (4) fails with `AttributeError: 'list' object has no attribute 'key'` when doing: `da.from_delayed(splitted_imgs_bag.to_delayed(), dataset_shape, dtype=np.uint8)`. By running `print(list(splitted_imgs_bag.to_delayed()))` it prints: `[Delayed(('pluck-5d54b90928938c55156c742068524a2d', 0))]`. It seems that `da.from_delayed` does not produce an object with a key attribute, as expected by `da.from_delayed`. – w00dy Dec 05 '17 at 12:38
  • da.from_delayed takes a single delayed object. You'll have to create many small dask arrays and then stack them together in step five. – MRocklin Dec 05 '17 at 13:17
  • Ok I see. Here's how I managed to do it: `images_array = da.concatenate([da.from_delayed(delayed(item[0]), shape=(nr_boxes_per_img, 200, 200, 3), dtype=np.uint8) for item in splitted_imgs_bag], axis=0)`. As I find it way too complex and still based on a list comprehension loop I'll select @r-zu question as the best one. Thanks a lot for your help. – w00dy Dec 06 '17 at 14:10
  • @MRocklin I'm trying to understand the RAM implications of this in a distributed setting. Suppose you have a bag of np arrays, each in their own partition, each residing on the resources of a different worker: `bag = db.from_sequence(inputs, npartitions=len(inputs)); list_of_delays = bag.map(returns_nparray).to_delayed(); list_of_darrays = [da.from_delayed(dd, my_shape, my_dtype) for dd in list_of_delays]; final_darray = da.stack(list_of_darrays, axis=0);` It should be the case that the array data never reduces to the primary process, is that correct? – NLi10Me Jul 14 '20 at 20:16