I need to convert a dask.Bag of {'imgs': np.array(img_list), 'lables': np.array(label_list)}
into two separate dask.Array-s.
Why I created Bag instead of go directly for Array? Because I'm processing that Bag multiple times through map()
; didn't manage to do the same with Array.
Now, the following code works for small datasets but clearly fails for bigger data.
images_array = da.from_array(np.array([item['images'] for item in imgs_labels_bag]), chunks=chunksize)
labels_array = da.from_array(np.array([item['labels'] for item in imgs_labels_bag]), chunks=chunksize)
How to do that without converting objs into numpy?
Ideas:
I've tried Bag -> Delayed -> Array but it didn't work because of something wrong with array shape.
An option might be to dump the Bag onto disk as text files as then read it as a dask.DataFrame/Array. Example:
b_dict.map(json.dumps).to_textfiles("/path/to/data/*.json")
Instead of having a Bag of dicts I could have 2 Bags of np.array each and then try Bag -> Delayed -> Array.
Any other ideas?