Python collection of different sized arrays (Jagged arrays), Dask?

Question

I have multiple 1-D numpy arrays of different size representing audio data. Since they're different sizes (e.g (8200,), (13246,), (61581,)), I cannot stack them as 1 array with numpy. The size difference is too big to engage in 0-padding.

I can keep them in a list or dictionary and then use for loops to iterate over them to do calculations, but I would prefer that I could approach it in numpy-style. Calling a numpy function on the variable, without having to write a for-loop. Something like:

np0 = np.array([.2, -.4, -.5])
np1 = np.array([-.8, .9])
np_mix = irregular_stack(np0, np1)
np.sum(np_mix)
# output: [-0.7, 0.09999999999999998]

Looking at this Dask picture, I was wondering if I can do what I want with Dask.

My attempt so far is this:

import numpy as np
import dask.array as da

np0 = np.array([.2, -.4, -.5])
arr0 = da.from_array(np0, chunks=(3,))
np1 = np.array([-.8, .9])
arr1 = da.from_array(np1, chunks=(2,))

# stack them
data = [[arr0],
        [arr1]]

x = da.block(data)
x.compute()

# output: ValueError: ('Shapes do not align: %s', [(1, 3), (1, 2)])

Questions

Am I misunderstanding how Dask can be used?
If it's possible, how do I do my np.sum() example?
If it's possible, is it actually faster than a for-loop on a high-end single PC?

score 3 · Accepted Answer · answered Nov 20 '19 at 03:04

I found the library awkward-array (https://github.com/scikit-hep/awkward-array), which allows for different length arrays and can do what I asked for:

import numpy as np
import awkward

np0 = np.array([.2, -.4, -.5])
np1 = np.array([-.8, .9])
varlen = awkward.fromiter([np0, np1])
# <JaggedArray [[0.2 -0.4 -0.5] [-0.8 0.9]] at 0x7f01a743e790>

varlen.sum()
# output: array([-0.7,  0.1])

The library describes itself as: "Manipulate arrays of complex data structures as easily as Numpy."

So far, it seems to satisfies everything I need.

score 0 · Answer 2 · answered Nov 19 '19 at 01:15

0

Unfortunately, Dask arrays follow Numpy semantics, and assume that all rows are of equal length.

I don't know of a good library in Python that efficiently handles ragged arrays today, so you may be out of luck.

answered Nov 19 '19 at 01:15

MRocklin

55,641
23
163
235

It seems that the library in my answer, `awkward-array`, has plans for interoperability with Dask from `v1.0.0`: https://github.com/scikit-hep/awkward-1.0 – NumesSanguis Nov 20 '19 at 03:06

Python collection of different sized arrays (Jagged arrays), Dask?

Questions

2 Answers2