0

Suppose we have a ragged, nested sequence like the following:

import numpy as np
x = np.ones((10, 20))
y = np.zeros((10, 20))
a = [[0, x], [y, 1]]

and want to create a full numpy array that broadcasts the ragged sub-sequences (to match the maximum dimension of any other sub-sequence, in this case (10,20)) where necessary. First, we might try to use np.array(a), which yields the warning:

VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray

By changing to np.array(a, dtype=object), we do get an array. However, this is an array of objects rather than floats, and retains the ragged subsequences, which have not been broadcasted as desired. To fix this, I created a new function to_array which takes a (possibly ragged, nested) sequence and a shape and returns a full numpy array of that shape:

    def to_array(a, shape):
        a = np.array(a, dtype=object)
        b = np.empty(shape)
        for index in np.ndindex(a.shape):
            b[index] = a[index]
        return b
    
    b = np.array(a, dtype=object)
    c = to_array(a, (2, 2, 10, 20))
    
    print(b.shape, b.dtype) # prints (2, 2) object
    print(c.shape, c.dtype) # prints (2, 2, 10, 20) float64

Note that c, not b, is the desired result. However, to_array relies on a for loop over nindex, and Python for loops are slow for big arrays.

Is there an alternative, vectorized way to write the to_array function?

David Medinets
  • 5,160
  • 3
  • 29
  • 42
user76284
  • 1,269
  • 14
  • 29
  • "that broadcasts the ragged sub-sequences (to match the maximum dimension of any other sub-sequence, in this case (10,20)) where necessary. " Sorry, I can't follow this. Could you show a small example that includes the exact, complete desired output (small enough that you can produce it by hand, but big enough to make it clear what you're looking for)? – Karl Knechtel Aug 11 '20 at 21:54
  • @KarlKnechtel `c` is the desired output. – user76284 Aug 11 '20 at 22:00

1 Answers1

2

Given the target shape, a few iterations doesn't seem overly expensive:

In [35]: C = np.empty((A.shape+x.shape), x.dtype)                                                    
In [36]: for idx in np.ndindex(A.shape): 
    ...:     C[idx] = A[idx] 
    ...:                                                                                             

Alternatively you could replace the 0 and 1 with the appropraite (10,20) arrays. Here you've already created those, x and y:

In [37]: D = np.array([[y,x],[y,x]])                                                                 
In [38]: np.allclose(C,D)                                                                            
Out[38]: True

In general a few iterations on a complex task are ok. Keep in mind that (many) operations on an object dtype array are actually slower than operations on an equivalent list. It's the whole-array compiled operations on a numeric array that are relatively fast. That's not your case.

But

C[0,0,:,:] = 0

uses broadcasting - all (10,20) values of C[0,0] are filled with the scalar 0 via broadcasting.

C[0,1,:,:] = x

is a different broadcasting, where the RHS matches the left. It's unreasonable to expect numpy to handle both cases with one broadcasting operation.

hpaulj
  • 221,503
  • 14
  • 230
  • 353