2

I have a Pandas (0.23.4) DataFrame with several categorical columns.

df = pd.DataFrame(np.random.choice([True, False, np.nan], (6,4)), columns = ['a','b','c','d'])

     a    b    c    d
0  NaN  1.0  NaN  NaN
1  NaN  1.0  NaN  0.0
2  1.0  NaN  1.0  NaN
3  0.0  NaN  0.0  1.0
4  NaN  1.0  NaN  NaN
5  NaN  1.0  0.0  1.0

I have two sets of columns of interest:

cross_cols = ['a', 'b']
type_cols = ['c', 'd']

I would like to get a cross tab of counts of each cross_col variable with each type_col variable (a with c and d, and b with c and d), excluding NaN, all displayed side-by-side. The desired result is:

         c             d        
       0.0  1.0  All 0.0 1.0 All
a 0.0  0    0    0     1   1   2
  1.0  2    1    3     1   0   1
  All  2    1    3     2   1   3
b 0.0  0    0    0     0   1   1
  1.0  2    1    3     2   0   2
  All  2    1    3     2   1   3

Notice that I am not interested in counts for different combinations of a and b or of c and d, which is what I'm getting by changing the index and columns parameters of pd.crosstab.

Currently I'm using the following code:

cross_rows = []
for col in cross_cols:
    cross_rows.append(pd.concat([pd.crosstab(df[col], df[type_var],margins=True) for type_var in type_cols],axis=1,keys = type_cols,sort=True))
results = pd.concat(cross_rows, keys = cross_cols,sort=True)

It gives the following result:

         c             d        
c      0.0  1.0  All 0.0 1.0 All
a 1.0  2.0  1.0  3.0   1   0   1
  All  2.0  1.0  3.0   2   1   3
  0.0  NaN  NaN  NaN   1   1   2
b 1.0  2.0  1.0  3.0   2   0   2
  All  2.0  1.0  3.0   2   1   3
  0.0  NaN  NaN  NaN   0   1   1

The result is fine, but the code is slow and a bit ugly. I suspect that there's a faster and more Pythonic approach. Is there a single function call that would get the job done, or another faster solution?

perigon
  • 2,160
  • 11
  • 16

0 Answers0