Copy a sub-recarray in stable NumPy

Question

Suppose I have data in a numpy.recarray, and I want to extract some of its columns. I want this to be an effective copy since data may be huge (I don't want to copy everything) but I will likely change these features without wanting to change data (I don't want a view).

Today, I would do the following:

data = np.array([(1.0, 2.0, 0), (3.0, 4.0, 1)], 
            dtype=[('feature_1', float), ('feature_2', float), ('result', int)])
data = data.view(np.recarray)

features = data[['feature_1', 'feature_2']]

However, it raises the following FutureWarning from NumPy:

/path/to/numpy/core/records.py:513: FutureWarning: Numpy has detected that you may be viewing or writing to an array returned by selecting multiple fields in a structured array.

This code may break in numpy 1.15 because this will return a view instead of a copy -- see release notes for details.

return obj.view(dtype=(self.dtype.type, obj.dtype))

This warning is very welcomed because I don't want to have a breaking change when I update NumPy. However, even going through the release notes, it is not clear what is the best solution to write something which implements this copy behavior while extracting columns as of today, and which will be stable through the upcoming releases.

In my particular case, near-optimal efficiency is required, and Pandas is unavailable. In these conditions, what would be the best workaround for this situation?

It is in flux and being reverted from the numpy discussion list. in the interim try ....data2 = np.rec.fromrecords(data.data, dtype=data.dtype) .... should yield .... data2.... rec.array([( 1., 2., 0), ( 3., 4., 1)], dtype=[('feature_1', ' — NaN, Apr 30 '18 at 15:09
This is for copying the whole data. I explicitly said I wanted to avoid this. — Thrastylon, Apr 30 '18 at 15:16

hpaulj · Accepted Answer · 2018-04-30T19:59:38.853

As noted, multifield selection is in a state of flux. I recently up dated to 1.14.2, and behavior is back to what it was before 1.14.0.

In [114]: data = np.array([(1.0, 2.0, 0), (3.0, 4.0, 1)], 
     ...:             dtype=[('feature_1', float), ('feature_2', float), ('resul
     ...: t', int)])
     ...:             
In [115]: data
Out[115]: 
array([(1., 2., 0), (3., 4., 1)],
      dtype=[('feature_1', '<f8'), ('feature_2', '<f8'), ('result', '<i8')])
In [116]: features = data[['feature_1', 'feature_2']]
In [117]: features
Out[117]: 
array([(1., 2.), (3., 4.)],
      dtype=[('feature_1', '<f8'), ('feature_2', '<f8')])

(I'm omitting the extra layer of recarray conversion.)

In 1.14.0 this dtype would include an offset value, indicating that features was a view, not a copy.

I can change values of features without changing data:

In [124]: features['feature_1']
Out[124]: array([1., 3.])
In [125]: features['feature_1'] = [4,5]
In [126]: features
Out[126]: 
array([(4., 2.), (5., 4.)],
      dtype=[('feature_1', '<f8'), ('feature_2', '<f8')])
In [127]: data
Out[127]: 
array([(1., 2., 0), (3., 4., 1)],
      dtype=[('feature_1', '<f8'), ('feature_2', '<f8'), ('result', '<i8')])

But without delving into the development discussion, I can't say what the long term solution will be. Ideally it should have both the ability to fetch a view (which maintains a link to the original databuffer), and a copy, an array that is independent and freely modifiable.

I suspect the copy version will follow a recfunctions practice of constructing a new array with the new dtype, and then copying data field by field.

In [132]: data.dtype.descr
Out[132]: [('feature_1', '<f8'), ('feature_2', '<f8'), ('result', '<i8')]
In [133]: dt = data.dtype.descr[:-1]
In [134]: dt
Out[134]: [('feature_1', '<f8'), ('feature_2', '<f8')]
In [135]: arr = np.zeros(data.shape, dtype=dt)
In [136]: arr
Out[136]: 
array([(0., 0.), (0., 0.)],
      dtype=[('feature_1', '<f8'), ('feature_2', '<f8')])
In [137]: for name in arr.dtype.fields:
     ...:     arr[name] = data[name]
     ...:     
In [138]: arr
Out[138]: 
array([(1., 2.), (3., 4.)],
      dtype=[('feature_1', '<f8'), ('feature_2', '<f8')])

or another recfunctions function:

In [159]: rf.drop_fields(data, 'result')
Out[159]: 
array([(1., 2.), (3., 4.)],
      dtype=[('feature_1', '<f8'), ('feature_2', '<f8')])

recfunctions has code that can copy complex dtypes, ones with nested dtypes and such. But for simple one-layered dtype like this, simple field name iteration is enough.

In general, structured arrays (and recarray) have many records, and a limited number of fields. So copying fields by name is relatively efficient.

In [150]: import numpy.lib.recfunctions as rf
In [154]: arr = np.zeros(data.shape, dtype=dt)
In [155]: rf.recursive_fill_fields(data, arr)
Out[155]: 
array([(1., 2.), (3., 4.)],
      dtype=[('feature_1', '<f8'), ('feature_2', '<f8')])

but note its code ends with:

output = np.empty(base.shape, dtype=newdtype)
output = recursive_fill_fields(base, output)

Development notes at some point alluded to a recfunctions.compress_fields function, but that apparently was never actually added.

Eric · Answer 2 · 2018-05-01T07:06:04.497

You could check if the result is a view with

features = data[['feature_1', 'feature_2']]
if np.may_share_memory(features, data):
    features = features.copy()

More fragile would be to check the version number:

features = data[['feature_1', 'feature_2']]
if np.lib.NumpyVersion(np.__version__) < np.lib.NumpyVersion('1.15.0'):
    features = features.copy()

Note that calling copy like this does use up unnecessary memory (that of the full array)

Copy a sub-recarray in stable NumPy

2 Answers2