Frequency count using itertools.groupby() with recarray

Question

The code goes something like this:

>>>data = pd.DataFrame({'P': ['p1', 'p1', 'p2'],
                        'Q': ['q1', 'q2', 'q1'],
                        'R': ['r1', 'r1', 'r2']})

>>>data

  P  Q  R
0 p1 q1 r1
1 p1 q2 r1
2 p2 q1 r2

>>>data.groupby(['R'] + ['P','Q']).size().unstack(['P','Q'])

After reindexing and fillna(0) it gives the following result:

P  p1      p2
Q  q1  q2  q1  q2
R
r1  1   1   0   0
r2  0   0   1   0

I wanted to do the same with recarray so I imported itertools and tried the following:

>>>data = np.array([('p1', 'p1', 'p2'), ('q1', 'q2', 'q1'), ('r1', 'r1', 'r2')], 
                     dtype=[('P',object),('Q',object),('R',object)]).view(np.recarray)

>>>groupby(data,key = (['R']+['P','Q'])).size().unstack(['P','Q'])

It doesn't work. How do I achieve a similar result without using pandas?

hpaulj · Accepted Answer · 2017-01-31T07:23:55.273

Let's back away from the fancy recarray and object type. It doesn't buy us anything.

The data can be a simple 2d array of strings:

In [711]: data = np.array([('p1', 'p1', 'p2'), ('q1', 'q2', 'q1'), ('r1', 'r1', 'r2')])
In [712]: data
Out[712]: 
array([['p1', 'p1', 'p2'],
       ['q1', 'q2', 'q1'],
       ['r1', 'r1', 'r2']], 
      dtype='<U2')

Better yet, make it a list of lists:

In [713]: data.tolist()
Out[713]: [['p1', 'p1', 'p2'], ['q1', 'q2', 'q1'], ['r1', 'r1', 'r2']]

intertools.group is designed to work with lists. It can operate on arrays simply because it can iterate on them.

Explain how you want to group these strings.

The pandas group by expression is not self explanatory.

If I simply flatten the data array, I can group sequential values and count them:

In [726]: data.ravel()
Out[726]: 
array(['p1', 'p1', 'p2', 'q1', 'q2', 'q1', 'r1', 'r1', 'r2'], 
      dtype='<U2')
In [727]: g=itertools.groupby(data.ravel())
In [728]: [(k,list(v)) for k,v in g]
Out[728]: 
[('p1', ['p1', 'p1']),
 ('p2', ['p2']),
 ('q1', ['q1']),
 ('q2', ['q2']),
 ('q1', ['q1']),
 ('r1', ['r1', 'r1']),
 ('r2', ['r2'])]
In [729]: g=itertools.groupby(data.ravel())
In [730]: [(k,len(list(v))) for k,v in g]
Out[730]: [('p1', 2), ('p2', 1), ('q1', 1), ('q2', 1), ('q1', 1), ('r1', 2), ('r2', 1)]

=============

Extending my answer to work row-wise

In [738]: grps = [itertools.groupby(row) for row in data]
In [739]: [[(k, len(list(v))) for k,v in r] for r in grps]
[[('p1', 2), ('p2', 1)],
 [('q1', 1), ('q2', 1), ('q1', 1)],
 [('r1', 2), ('r2', 1)]]

This works for the object recarray version of data as well.

Oops - I misunderstood your 'row-wise' description. Even rereading your last comment I don't understand what you want. It doesn't sound like a itertools.groupby problem at all. I thought you were counting strings like 'r1' and 'q2'. Apparently that's not the case.

====================

OK, a more focused attempt to recreate the pandas table

Use itertools.product to generate 8 combinations of these 6 strings:

In [847]: pos = list(product(['r1','r2'],['p1','p2'],['q1','q2']))
In [848]: pos
Out[848]: 
[('r1', 'p1', 'q1'),
 ('r1', 'p1', 'q2'),
 ('r1', 'p2', 'q1'),
 ('r1', 'p2', 'q2'),
 ('r2', 'p1', 'q1'),
 ('r2', 'p1', 'q2'),
 ('r2', 'p2', 'q1'),
 ('r2', 'p2', 'q2')]

convert the dataframe to a list of lists:

In [849]: val=data.values[:,[2,0,1]].tolist()
In [850]: val
Out[850]: [['r1', 'p1', 'q1'], ['r1', 'p1', 'q2'], ['r2', 'p2', 'q1']]

find which of the possible combinations are found in vals:

In [852]: [[i, list(i) in val] for i in pos]
Out[852]: 
[[('r1', 'p1', 'q1'), True],
 [('r1', 'p1', 'q2'), True],
 [('r1', 'p2', 'q1'), False],
 [('r1', 'p2', 'q2'), False],
 [('r2', 'p1', 'q1'), False],
 [('r2', 'p1', 'q2'), False],
 [('r2', 'p2', 'q1'), True],
 [('r2', 'p2', 'q2'), False]]

Rework the 'counts' as a 2x8 0/1 array:

In [853]: np.array([[list(i) in val] for i in pos]).reshape(2,-1).astype(int)
Out[853]: 
array([[1, 1, 0, 0],
       [0, 0, 1, 0]])

Sorry there was a typo. I've replaced 'R': ['q1', 'q2', 'q1'] with 'R': ['r1', 'r1', 'r2']. Grouping is done row wise. For instance, there is a row 'p1 q1 r1' and it occurs once so there is a one corresponding to it in the output. The output shows all the possible combinations of rows that can occur. It displays the frequency of these combinations if they exist and zero if they dont. Here, 'p1 q1 r2' does not exist so there is a zero corresponding to it. I'm trying for a similar output without using pandas. Also, the input needs to be a structured array of object datatype. @hpaulj — geedee, Jan 30 '17 at 09:04
It's still not clear what you want. This isn't a `recarray` or `itertools.groupby` problem. — hpaulj, Jan 30 '17 at 19:58
By rows I meant the rows in the dataframe, the ones in the output. So these are the rows I was talking about- (p1 q1 r1), (p1 q2 r1) and (p2 q1 r2). All I wanted was an output similar to the result above without using pandas. itertools.groupby() was just something I tried that didn't workout. My apologies if it wasn't clear. Still a great answer though, I got to learn something new. Thanks- @hpaulj — geedee, Jan 31 '17 at 05:53
OK, i've recreated your 2x4 table by searching a set of 8 possible combinations of these strings. — hpaulj, Jan 31 '17 at 07:23

Frequency count using itertools.groupby() with recarray

1 Answers1