Split numpy recarray based on value in one column

Question

my real data has some 10000+ items. I have a complicated numpy record array of a format roughly like:

a = (((1., 2., 3.), 4., 'metadata1'), 
     ((1., 3., 5.), 5., 'metadata1'), 
     ((1., 2., 4.), 5., 'metadata2'),
     ((1., 2., 5.), 5., 'metadata2'),  
     ((1., 3., 8.), 5., 'metadata3'))

My columns are defined by dtype = [('coords', '3f4'), ('values', 'f4'), ('meta', 'S10')]. I get a list of all my possible meta values by doing set(a['meta']).

And I'd like to split it into smaller lists based on the 'meta' column. Ideally, I'd like results like:

a['metadata1'] == (((1., 2., 3.), 4.), ((1., 3., 5.), 5.))
a['metadata2'] == (((1., 2., 4.), 5.), ((1., 2., 5.), 5.))
a['metadata3'] == (((1., 3., 8.), 5.))

or

a[0] = (((1., 2., 3.), 4., 'metadata1'), ((1., 3., 5.), 5., 'metadata1'))
a[1] = (((1., 2., 4.), 5., 'metadata2'), ((1., 2., 5.), 5., 'metadata2'))
a[2] = (((1., 3., 8.), 5., 'metadata3'))

or any other conveniently split format.

Although, for a large dataset, the former is better on memory. Any ideas on how to do this split? I've seen some other questions here, but they are all testing for numerical values.

ebarr · Accepted Answer · 2014-05-30T23:46:56.937

2

You can always access those rows easily using fancy indexing:

In [34]: a[a['meta']=='metadata2']
Out[34]: 
rec.array([(array([ 1.,  2.,  4.], dtype=float32), 5.0, 'metadata2'),
           (array([ 1.,  2.,  5.], dtype=float32), 5.0, 'metadata2')], 
          dtype=[('coords', '<f4', (3,)), ('values', '<f4'), ('meta', 'S10')])

You can use this approach to create lookup dictionary for the different meta types:

meta_dict = {}
for meta_type in np.unique(a['meta']):
    meta_dict[meta_type] = a[a['meta']==meta_type]

This will be very inefficient if there are a large number of meta types.

A more efficient solution might be to look into using a Pandas dataframe. These have a group by functionality that performs exactly the task you describe.

edited May 30 '14 at 23:46

answered May 30 '14 at 23:41

ebarr

7,704
1
29
40

A minor difference from the OP asks - `'meta'` appears in both the key and value. But removing it probably isn't worth the effort, unless space is particularly precious. And if this reorganization isn't done frequently, trying to find something faster ('more efficient') might not be worth the extra programming time. – hpaulj May 31 '14 at 01:50
Could use `a[a['meta']=='metadata2'][:,:-1]` to strip the meta column. – troy.unrau Jun 01 '14 at 17:50

Split numpy recarray based on value in one column

1 Answers1