Pandas dataframe to count matrix

Question

This must be obvious, but I couldn't find an easy solution.

I have pandas DataFrame like this:

actual | predicted
------ + ---------
Apple  | Apple
Apple  | Apple
Apple  | Banana
Banana | Orange
Orange | Apple

I want this:

       |  Apple  | Banana  | Orange
------ + ------- + ------- + -------
Apple  |  2      | 1       | 0
Banana |  0      | 0       | 1
Orange |  1      | 0       | 0

score 25 · Accepted Answer · answered Nov 28 '16 at 08:31

25

You can use groupby with aggregating size and unstack MultiIndex:

df = df.groupby(['actual','predicted']).size().unstack(fill_value=0)
print (df)
predicted  Apple  Banana  Orange
actual                          
Apple          2       1       0
Banana         0       0       1
Orange         1       0       0

Another solution with crosstab:

df = pd.crosstab(df.actual, df.predicted)
print (df)
predicted  Apple  Banana  Orange
actual                          
Apple          2       1       0
Banana         0       0       1
Orange         1       0       0

answered Nov 28 '16 at 08:31

jezrael

822,522
95
1,334
1,252

3

Accepted for `crosstab`. Unstacking size() seems a bit like a hack to me. – Gregor Sturm Nov 28 '16 at 08:54
Thank you for accepting! First solution is faster, but I agree it is a bit hack. – jezrael Nov 28 '16 at 08:55
1

`In [142]: %timeit pd.crosstab(df.actual, df.predicted) 100 loops, best of 3: 5.04 ms per loop` and `In [143]: %timeit df.groupby(['actual','predicted']).size().unstack(fill_value=0) 1000 loops, best of 3: 1.28 ms per loop` – jezrael Nov 28 '16 at 09:14
2

@GregorSturm not a hack at all. You are counting the number of times you see each combination. To get the matrix, you unstack and fill nulls with zero. I'm sure if you looked under the hood of `crosstab` your see much the same thing. – piRSquared Nov 28 '16 at 10:05
@piRSquared - Thank you. I am still surprised `crosstab` is slowier. But Ithink there is difference in implementation of functions. – jezrael Nov 28 '16 at 10:07
1

+1 for the first solution involving `groupby()`, because it allows me to do `sum()` instead of `size()`, whereas the `crosstab()` solution does not. Thanks! – tommy.carstensen Aug 30 '18 at 23:49

score 3 · Answer 2 · answered Nov 28 '16 at 08:32

You can use pandas.pivot_table

>>> df.pivot_table(index='actual', columns='predicted', aggfunc=len).fillna(0).astype('int')
predicted Apple Banana Orange
actual                       
Apple         2      1      0
Banana        0      0      1
Orange        1      0      0

score 2 · Answer 3 · answered Nov 28 '16 at 08:32

2

Bit of a shot in the dark, but I think you are looking for a confusion matrix

from sklearn.metrics import confusion_matrix
print confusion_matrix(df['actual'], df['predicted'])

answered Nov 28 '16 at 08:32

Vivek Kalyanarangan

8,951
1
23
42

Pandas dataframe to count matrix

3 Answers3