0

Here's my input:

data = np.array ([( 'a1' , np.NaN , 'a2' ), 
                  ( 'a1' , 'b2' , 'b1' ),  
                  ( 'c1' , 'c1' , np.NaN )], 
                 dtype = [( 'A' , object ), 
                          ( 'B' , object ), 
                          ( 'C' , object )] ).view (np.recarray)

I want to count the frequency of each value a variable takes and I want the output to look something like ( say for the input freq('A') ):

array [ ( 'a1' , 2 ) , ( 'c1' , 1 ) ]

I've tried np.bincounts() but apparently it doesn't work for object datatypes. Is there a way to achieve this using NumPy?

MSeifert
  • 145,886
  • 38
  • 333
  • 352
geedee
  • 45
  • 7

1 Answers1

1

You could use np.unique to assign an integer "label" to each object in data['A']. Then you can apply np.bincount to the labels:

In [18]: uniq, label = np.unique(data['A'], return_inverse=True)

In [19]: np.column_stack([uniq, np.bincount(label)])
Out[19]: 
array([['a1', 2],
       ['c1', 1]], dtype=object)

Note that operations on NumPy arrays of dtype object are no faster (and often are slower) than equivalent operations on lists. (You need to use arrays with native NumPy (i.e. non-object) dtypes to enjoy any speed advantage over pure Python.) For example, your computation may be faster if you use a dict of lists for data, and count the frequency with collections.Counter:

In [21]: data = {'A':['a1','a1','c1']}

In [22]: import collections

In [23]: collections.Counter(data['A'])
Out[23]: Counter({'a1': 2, 'c1': 1})

As hpaulj points out, you could use collection.Counter(data['A']) when data is a recarray too. It is faster than the np.unique/np.bincount method shown above. So that might be your best option if you must use a recarray of objects.


Here is a benchmark showing the relative speeds:

data = np.random.choice(['a','b','c'], size=(300,)).astype(
    [('A', object), ('B', object), ('C', object)]).view(np.recarray)
data2 = {key:data[key].tolist() for key in ['A','B','C']}

Using Counter on a dict of lists is fastest:

In [92]: %timeit collections.Counter(data2['A'])
100000 loops, best of 3: 13.7 µs per loop

Using Counter on an array of dtype object is next fastest:

In [91]: %timeit collections.Counter(data['A'])
10000 loops, best of 3: 29.1 µs per loop

My original suggestion is downright slow (though this is an apples-to-oranges comparison since this returns an array, not a dict):

In [93]: %%timeit 
   ....: uniq, label = np.unique(data['A'], return_inverse=True)
   ....: np.column_stack([uniq, np.bincount(label)])
   ....: 
10000 loops, best of 3: 118 µs per loop
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • `Counter(data['A'])` works if `data` is the recarray. In other words, it works on a 1d array of objects. `Counter(np.array(data.tolist(),object).ravel())` is needed to count the whole array (i.e. convert to 2d array of objects and flatten). – hpaulj Jan 28 '17 at 20:32
  • Thanks, @hpaulj. That's certainly better than the `unique/bincount` method I suggested. – unutbu Jan 28 '17 at 21:59