Selecting columns from numpy recarray

Question

I have an object from type numpy.core.records.recarray. I want to use it effectively as pandas dataframe. More precisely, I want to use a subset of its columns in order to obtain a new recarray, the same way you would do pandas_dataframe[[selected_columns]].

What's the easiest way to achieve this?

hpaulj · Answer 1 · 2022-05-28T18:05:44.720

Without using pandas you can select a subset of the fields of a structured array (recarray). For example:

In [338]: dt=np.dtype('i,f,i,f')
In [340]: A=np.ones((3,),dtype=dt)
In [341]: A[:]=(1,2,3,4)

In [342]: A
Out[342]: 
array([(1, 2.0, 3, 4.0), (1, 2.0, 3, 4.0), (1, 2.0, 3, 4.0)], 
      dtype=[('f0', '<i4'), ('f1', '<f4'), ('f2', '<i4'), ('f3', '<f4')])

a subset of the fields.

In [343]: B=A[['f1','f3']].copy()

In [344]: B
Out[344]: 
array([(2.0, 4.0), (2.0, 4.0), (2.0, 4.0)], 
      dtype=[('f1', '<f4'), ('f3', '<f4')])

that can be modified independently of A:

In [346]: B['f3']=[.1,.2,.3]

In [347]: B
Out[347]: 
array([(2.0, 0.10000000149011612), (2.0, 0.20000000298023224),
       (2.0, 0.30000001192092896)], 
      dtype=[('f1', '<f4'), ('f3', '<f4')])

In [348]: A
Out[348]: 
array([(1, 2.0, 3, 4.0), (1, 2.0, 3, 4.0), (1, 2.0, 3, 4.0)], 
      dtype=[('f0', '<i4'), ('f1', '<f4'), ('f2', '<i4'), ('f3', '<f4')])

The structured subset of fields is not highly developed. A[['f0','f1']] is enough for viewing, but it will warn or give an error if you try to modify that subset. That's why I used copy with B.

There's a set of functions that facilitate adding and removing fields from recarrays. I'll have to look up the access pattern. But mostly the construct a new dtype, and empty array, and then copy fields by name.

import numpy.lib.recfunctions as rf

update

With newer numpy versions, the multi-field index has changed

In [17]: B=A[['f1','f3']]

In [18]: B
Out[18]: 
array([(2., 4.), (2., 4.), (2., 4.)],
      dtype={'names':['f1','f3'], 'formats':['<f4','<f4'], 'offsets':[4,12], 'itemsize':16})

This B is a true view, referencing the same data buffer as A. The offsets lets it ignore the missing fields. Those fields can be removed with repack_fields as just documented.

But when putting this into a dataframe, it doesn't look like we need to do that.

In [19]: df = pd.DataFrame(A)
In [21]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   f0      3 non-null      int32  
 1   f1      3 non-null      float32
 2   f2      3 non-null      int32  
 3   f3      3 non-null      float32
dtypes: float32(2), int32(2)
memory usage: 176.0 bytes

In [22]: df = pd.DataFrame(B)
In [24]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   f1      3 non-null      float32
 1   f3      3 non-null      float32
dtypes: float32(2)
memory usage: 152.0 bytes

The frame created from B is smaller.

Sometimes when making a dataframe from an array, the array itself is used as the frame's memory. Changing values in the source array will change the values in the frame. But with structured arrays, pandas makes a copy of the data, with a different memory layout.

Columns of matching dtype are grouped into a common NumericBlock:

In [42]: pd.DataFrame(A)._data
Out[42]: 
BlockManager
Items: Index(['f0', 'f1', 'f2', 'f3'], dtype='object')
Axis 1: RangeIndex(start=0, stop=3, step=1)
NumericBlock: slice(1, 5, 2), 2 x 3, dtype: float32
NumericBlock: slice(0, 4, 2), 2 x 3, dtype: int32

In [43]: pd.DataFrame(B)._data
Out[43]: 
BlockManager
Items: Index(['f1', 'f3'], dtype='object')
Axis 1: RangeIndex(start=0, stop=3, step=1)
NumericBlock: slice(0, 2, 1), 2 x 3, dtype: float32

While this does indeed produce a copy of a recarray with a subset of fields, the copy has the same memory footprint. In my case I'm "copying" two 200kb fields from a 9 GB recarray, and the copy also uses 9GB, even though it's only really using 400kb to store the copied objects — Jthorpe, May 28 '22 at 15:31
@Jthorpe, https://numpy.org/doc/stable/user/basics.rec.html#accessing-multiple-fields — hpaulj, May 28 '22 at 15:45
Are you pointing out that accessing multiple fields returns a view? What I observed is that even `A[['f1','f3']].copy()` has the same memory footprint as A — Jthorpe, May 28 '22 at 15:51
@Jthorpe, since the time I wrote this answer the handling of multifield indexing has changed. It's a true view. You have to use a documented `repack` if you want a copy with reduced memory. — hpaulj, May 28 '22 at 17:14

score 0 · Answer 2 · answered May 28 '22 at 15:47

In addition to @hpaulj answer, you'll want to repack the copy, otherwise the copied subset will have the same size memory footprint as the original.

import numpy as np
# note that you have to import this library explicitly
import numpy.lib.recfunctions

# B has a subset of "colums" but uses the same amount of memory as A
B = A[['f1','f3']].copy()

# C has a smaller memory footprint
C = np.lib.recfunctions.repack_fields(B)

Selecting columns from numpy recarray

2 Answers2

update