You could use np.unique
to assign an integer "label" to each object in data['A']
. Then you can apply np.bincount
to the label
s:
In [18]: uniq, label = np.unique(data['A'], return_inverse=True)
In [19]: np.column_stack([uniq, np.bincount(label)])
Out[19]:
array([['a1', 2],
['c1', 1]], dtype=object)
Note that operations on NumPy arrays of dtype object
are no faster (and often are slower) than equivalent operations on lists. (You need to use arrays with native NumPy (i.e. non-object) dtypes to enjoy any speed advantage over pure Python.) For example, your computation may be faster if you use a dict of lists for data
, and count the frequency with collections.Counter
:
In [21]: data = {'A':['a1','a1','c1']}
In [22]: import collections
In [23]: collections.Counter(data['A'])
Out[23]: Counter({'a1': 2, 'c1': 1})
As hpaulj points out, you could use collection.Counter(data['A'])
when data
is a recarray too. It is faster than the np.unique
/np.bincount
method shown above. So that might be your best option if you must use a recarray of objects.
Here is a benchmark showing the relative speeds:
data = np.random.choice(['a','b','c'], size=(300,)).astype(
[('A', object), ('B', object), ('C', object)]).view(np.recarray)
data2 = {key:data[key].tolist() for key in ['A','B','C']}
Using Counter
on a dict of lists is fastest:
In [92]: %timeit collections.Counter(data2['A'])
100000 loops, best of 3: 13.7 µs per loop
Using Counter
on an array of dtype object
is next fastest:
In [91]: %timeit collections.Counter(data['A'])
10000 loops, best of 3: 29.1 µs per loop
My original suggestion is downright slow (though this is an apples-to-oranges comparison since this returns an array, not a dict):
In [93]: %%timeit
....: uniq, label = np.unique(data['A'], return_inverse=True)
....: np.column_stack([uniq, np.bincount(label)])
....:
10000 loops, best of 3: 118 µs per loop