0

For each row of data in a DataFrame I would like to compute the number of unique values in columns A and B for that particular row and a reference row within the group identified by another column ID. Here is a toy dataset:

d = {'ID' : pd.Series([1,1,1,2,2,2,2,3,3])
     ,'A' : pd.Series([1,2,3,4,5,6,7,8,9])
     ,'B' : pd.Series([1,2,3,4,11,12,13,14,15])
     ,'REFERENCE' : pd.Series([1,0,0,0,0,1,0,1,0])}
data = pd.DataFrame(d)

The data looks like this:

In [3]: data
Out[3]: 
   A   B  ID  REFERENCE
0  1   1   1          1
1  2   2   1          0
2  3   3   1          0
3  4   4   2          0
4  5  11   2          0
5  6  12   2          1
6  7  13   2          0
7  8  14   3          1
8  9  15   3          0

Now, within each group defined using ID I want to compare each record with the reference record and I want to compute the number of unique A and B values for the combination. For instance, I can compute the value for data record 3 by taking len(set([4,4,6,12])) which gives 3. The result should look like this:

   A   B  ID  REFERENCE  CARDINALITY
0  1   1   1          1            1
1  2   2   1          0            2
2  3   3   1          0            2
3  4   4   2          0            3
4  5  11   2          0            4
5  6  12   2          1            2
6  7  13   2          0            4
7  8  14   3          1            2
8  9  15   3          0            3

The only way I can think of implementing this is using for loops that loop over each grouped object and then each record within the grouped object and computes it against the reference record. This is non-pythonic and very slow. Can anyone please suggest a vectorized approach to achieve the same?

sriramn
  • 2,338
  • 4
  • 35
  • 45
  • it looks like your data belong in groups. Have you thought of pre-grouping the data prior to loading into Pandas? – ericmjl Feb 19 '15 at 18:51
  • @ericmjl Do you mean have as many DataFrames as there are groups? Is it not better to just do something like `data.groupby('ID')`? – sriramn Feb 19 '15 at 19:14
  • I was going to suggest exactly what you mentioned. – ericmjl Feb 20 '15 at 20:08

1 Answers1

0

I would create a new column where I combine a and b into a tuple and then I would group by And then use groups = dict(list(groupby)) and then get the length of each frame using len()

Skorpeo
  • 2,362
  • 2
  • 15
  • 20