Is there an alternative, vectored way to write the to_array function?

Question

Suppose we have a ragged, nested sequence like the following:

import numpy as np
x = np.ones((10, 20))
y = np.zeros((10, 20))
a = [[0, x], [y, 1]]

and want to create a full numpy array that broadcasts the ragged sub-sequences (to match the maximum dimension of any other sub-sequence, in this case (10,20)) where necessary. First, we might try to use np.array(a), which yields the warning:

VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray

By changing to np.array(a, dtype=object), we do get an array. However, this is an array of objects rather than floats, and retains the ragged subsequences, which have not been broadcasted as desired. To fix this, I created a new function to_array which takes a (possibly ragged, nested) sequence and a shape and returns a full numpy array of that shape:

    def to_array(a, shape):
        a = np.array(a, dtype=object)
        b = np.empty(shape)
        for index in np.ndindex(a.shape):
            b[index] = a[index]
        return b
    
    b = np.array(a, dtype=object)
    c = to_array(a, (2, 2, 10, 20))
    
    print(b.shape, b.dtype) # prints (2, 2) object
    print(c.shape, c.dtype) # prints (2, 2, 10, 20) float64

Note that c, not b, is the desired result. However, to_array relies on a for loop over nindex, and Python for loops are slow for big arrays.

Is there an alternative, vectorized way to write the to_array function?

"that broadcasts the ragged sub-sequences (to match the maximum dimension of any other sub-sequence, in this case (10,20)) where necessary. " Sorry, I can't follow this. Could you show a small example that includes the exact, complete desired output (small enough that you can produce it by hand, but big enough to make it clear what you're looking for)? — Karl Knechtel, Aug 11 '20 at 21:54

hpaulj · Answer 1 · 2020-08-12T04:03:05.653

Given the target shape, a few iterations doesn't seem overly expensive:

In [35]: C = np.empty((A.shape+x.shape), x.dtype)                                                    
In [36]: for idx in np.ndindex(A.shape): 
    ...:     C[idx] = A[idx] 
    ...:

Alternatively you could replace the 0 and 1 with the appropraite (10,20) arrays. Here you've already created those, x and y:

In [37]: D = np.array([[y,x],[y,x]])                                                                 
In [38]: np.allclose(C,D)                                                                            
Out[38]: True

In general a few iterations on a complex task are ok. Keep in mind that (many) operations on an object dtype array are actually slower than operations on an equivalent list. It's the whole-array compiled operations on a numeric array that are relatively fast. That's not your case.

But

C[0,0,:,:] = 0

uses broadcasting - all (10,20) values of C[0,0] are filled with the scalar 0 via broadcasting.

C[0,1,:,:] = x

is a different broadcasting, where the RHS matches the left. It's unreasonable to expect numpy to handle both cases with one broadcasting operation.

Is there an alternative, vectored way to write the to_array function?

1 Answers1