How to get field of nested numpy structured array (advanced indexing)

Question

I have a complex nested structured array (often used as a recarray). Its simplified for this example, but in the real case there are multiple levels.

c = [('x','f8'),('y','f8')]
A = [('data_string','|S20'),('data_val', c, 2)]
zeros = np.zeros(1, dtype=A)
print(zeros["data_val"]["x"])

I am trying to index the "x" datatype of the nested arrays datatype without defining the preceding named fields. I was hoping something like print(zeros[:,"x"]) would let me slice all of the top level data, but it doesn't work.

Are there ways to do fancy indexing with nested structured arrays with accessing their field names?

Each field level has to be indexed separately. You can't combine them into one.. — hpaulj, Jan 12 '22 at 15:39
@hpaulj so its not possible to treat it as multi-dimensional and index the top level as "all" or [:] in order to access the lowest level? Meaning I do need to know what the preceding level field names are? — 001001, Jan 12 '22 at 16:05
If there were, it would be documented on the https://numpy.org/doc/stable/user/basics.rec.html page. Indexing fields is more like `dict` indexing than multidimensional array indexing. You have defined a nested dtype, not a multidimensional dtype. — hpaulj, Jan 12 '22 at 16:19
It looks like you want a dataframe data structure like the one provided by Pandas (but not Numpy). — Jérôme Richard, Jan 12 '22 at 16:32

hpaulj · Accepted Answer · 2022-01-12T18:53:48.767

I don't know if displaying the resulting array helps you visualize the nesting or not.

In [279]: c = [('x','f8'),('y','f8')]
     ...: A = [('data_string','|S20'),('data_val', c, 2)]
     ...: arr = np.zeros(2, dtype=A)
In [280]: arr
Out[280]: 
array([(b'', [(0., 0.), (0., 0.)]), (b'', [(0., 0.), (0., 0.)])],
      dtype=[('data_string', 'S20'), ('data_val', [('x', '<f8'), ('y', '<f8')], (2,))])

Note how the nesting of () and [] reflects the nesting of the fields.

arr.dtype only has direct access to the top level field names:

In [281]: arr.dtype.names
Out[281]: ('data_string', 'data_val')
In [282]: arr['data_val']
Out[282]: 
array([[(0., 0.), (0., 0.)],
       [(0., 0.), (0., 0.)]], dtype=[('x', '<f8'), ('y', '<f8')])

But having accessed one field, we can then look at its fields:

In [283]: arr['data_val'].dtype.names
Out[283]: ('x', 'y')
In [284]: arr['data_val']['x']
Out[284]: 
array([[0., 0.],
       [0., 0.]])

Record number indexing is separate, and can be multidimensional in the usual sense:

In [285]: arr[1]['data_val']['x'] = [1,2]
In [286]: arr[0]['data_val']['y'] = [3,4]
In [287]: arr
Out[287]: 
array([(b'', [(0., 3.), (0., 4.)]), (b'', [(1., 0.), (2., 0.)])],
      dtype=[('data_string', 'S20'), ('data_val', [('x', '<f8'), ('y', '<f8')], (2,))])

Since the data_val field has a (2,) shape, we can mix/match that index with the (2,) shape of arr:

In [289]: arr['data_val']['x']
Out[289]: 
array([[0., 0.],
       [1., 2.]])
In [290]: arr['data_val']['x'][[0,1],[0,1]]
Out[290]: array([0., 2.])
In [291]: arr['data_val'][[0,1],[0,1]]
Out[291]: array([(0., 3.), (2., 0.)], dtype=[('x', '<f8'), ('y', '<f8')])

I mentioned that fields indexing is like dict indexing. Note this display of the fields:

In [294]: arr.dtype.fields
Out[294]: 
mappingproxy({'data_string': (dtype('S20'), 0),
              'data_val': (dtype(([('x', '<f8'), ('y', '<f8')], (2,))), 20)})

Each record is stored as a block of 52 bytes:

In [299]: arr.itemsize
Out[299]: 52
In [300]: arr.dtype.str
Out[300]: '|V52'

20 of those are data_string, and 32 are the 2 c fields

In [303]: arr['data_val'].dtype.str
Out[303]: '|V16'

You can ask for a list of fields, and get a special kind of view. Its dtype display is a little different

In [306]: arr[['data_val']]
Out[306]: 
array([([(0., 3.), (0., 4.)],), ([(1., 0.), (2., 0.)],)],
      dtype={'names': ['data_val'], 'formats': [([('x', '<f8'), ('y', '<f8')], (2,))], 'offsets': [20], 'itemsize': 52})

In [311]: arr['data_val'][['y']]
Out[311]: 
array([[(3.,), (4.,)],
       [(0.,), (0.,)]],
      dtype={'names': ['y'], 'formats': ['<f8'], 'offsets': [8], 'itemsize': 16})

Each 'data_val' starts 20 bytes into the 52 byte record. And each 'y' starts 8 bytes into its 16 byte record.

this is really helpful, although it also raises more questions for me outside the scope of the original question. — 001001, Jan 12 '22 at 20:08

score 1 · Answer 2 · answered Jan 12 '22 at 16:43

The statement zeros['data_val'] creates a view into the array, which may already be non-contiguous at that point. You can extract multiple values of x because c is an array type, meaning that x has clearly defined strides and shape. The semantics of the statement zeros[:, 'x'] are very unclear. For example, what happens to data_string, which has no x? I would expect an error; you might expect something else.

The only way I can see the index being simplified, is if you expand c into A directly, sort of like an anonymous structure in C, except you can't do that easily with an array.

How to get field of nested numpy structured array (advanced indexing)

2 Answers2