Without using pandas you can select a subset of the fields of a structured array (recarray). For example:
In [338]: dt=np.dtype('i,f,i,f')
In [340]: A=np.ones((3,),dtype=dt)
In [341]: A[:]=(1,2,3,4)
In [342]: A
Out[342]:
array([(1, 2.0, 3, 4.0), (1, 2.0, 3, 4.0), (1, 2.0, 3, 4.0)],
dtype=[('f0', '<i4'), ('f1', '<f4'), ('f2', '<i4'), ('f3', '<f4')])
a subset of the fields.
In [343]: B=A[['f1','f3']].copy()
In [344]: B
Out[344]:
array([(2.0, 4.0), (2.0, 4.0), (2.0, 4.0)],
dtype=[('f1', '<f4'), ('f3', '<f4')])
that can be modified independently of A
:
In [346]: B['f3']=[.1,.2,.3]
In [347]: B
Out[347]:
array([(2.0, 0.10000000149011612), (2.0, 0.20000000298023224),
(2.0, 0.30000001192092896)],
dtype=[('f1', '<f4'), ('f3', '<f4')])
In [348]: A
Out[348]:
array([(1, 2.0, 3, 4.0), (1, 2.0, 3, 4.0), (1, 2.0, 3, 4.0)],
dtype=[('f0', '<i4'), ('f1', '<f4'), ('f2', '<i4'), ('f3', '<f4')])
The structured subset of fields is not highly developed. A[['f0','f1']]
is enough for viewing, but it will warn or give an error if you try to modify that subset. That's why I used copy
with B
.
There's a set of functions that facilitate adding and removing fields from recarrays. I'll have to look up the access pattern. But mostly the construct a new dtype
, and empty array, and then copy fields by name.
import numpy.lib.recfunctions as rf
update
With newer numpy versions, the multi-field index has changed
In [17]: B=A[['f1','f3']]
In [18]: B
Out[18]:
array([(2., 4.), (2., 4.), (2., 4.)],
dtype={'names':['f1','f3'], 'formats':['<f4','<f4'], 'offsets':[4,12], 'itemsize':16})
This B
is a true view
, referencing the same data buffer as A
. The offsets
lets it ignore the missing fields. Those fields can be removed with repack_fields
as just documented.
But when putting this into a dataframe, it doesn't look like we need to do that.
In [19]: df = pd.DataFrame(A)
In [21]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 f0 3 non-null int32
1 f1 3 non-null float32
2 f2 3 non-null int32
3 f3 3 non-null float32
dtypes: float32(2), int32(2)
memory usage: 176.0 bytes
In [22]: df = pd.DataFrame(B)
In [24]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 f1 3 non-null float32
1 f3 3 non-null float32
dtypes: float32(2)
memory usage: 152.0 bytes
The frame created from B
is smaller.
Sometimes when making a dataframe from an array, the array itself is used as the frame's memory. Changing values in the source array will change the values in the frame. But with structured arrays, pandas makes a copy of the data, with a different memory layout.
Columns of matching dtype are grouped into a common NumericBlock
:
In [42]: pd.DataFrame(A)._data
Out[42]:
BlockManager
Items: Index(['f0', 'f1', 'f2', 'f3'], dtype='object')
Axis 1: RangeIndex(start=0, stop=3, step=1)
NumericBlock: slice(1, 5, 2), 2 x 3, dtype: float32
NumericBlock: slice(0, 4, 2), 2 x 3, dtype: int32
In [43]: pd.DataFrame(B)._data
Out[43]:
BlockManager
Items: Index(['f1', 'f3'], dtype='object')
Axis 1: RangeIndex(start=0, stop=3, step=1)
NumericBlock: slice(0, 2, 1), 2 x 3, dtype: float32