Clean alternative to pandas' broken crosstab for NA values

Question

I'm trying to get a function that works similarly to R's table function, for which the parameter useNA allows me to include NA values in the cross table.

Here's a small example:

df = pd.DataFrame({"a": [0, 1, pd.NA, pd.NA], "b":[2, pd.NA, 3, pd.NA]})
print(pd.crosstab(df["a"], df["b"], dropna=False)

What I get from that is

b  2  3
a      
0  1  0

But I'd want it to be something like

b   2  3  NA
a      
0   1  0  0
1   0  0  1
NA  0  1  1

Not only does this ignore 3 quarters of the lines in the dataframe, the result is also dependant on the order the two series were inserted in, here's pd.crosstab(df["b"], df["a"], dropna=False):

a  0  1
b      
2  1  0

A workaround I can think of would be to look at the unique values in the two series and create a new value that is in neither of them and use it to temporarily replace NA values with it using fillna, but this feels very botchy and I'd be surprised if there weren't already something that cleanly does exactly what I'm looking for.

Plus, that solution wouldn't work as intended in situations where one of the two series had no missing data.

Edit: Adding an example to illustrate the last part.

df = pd.DataFrame({"a": [0, 1, 2, 3], "b":[2, pd.NA, 3, pd.NA]})
print(pd.crosstab(df["a"].fillna("NA"), df["b"].fillna("NA"), dropna=False)

Outputs:

b  2  3  NA
a          
0  1  0   0
1  0  0   1
2  0  1   0
3  0  0   1

Expected:

b   2  3  NA
a          
0   1  0   0
1   0  0   1
2   0  1   0
3   0  0   1
NA  0  0   0

jezrael · Accepted Answer · 2021-12-02T11:11:57.280

You can replace missing values to NA:

print(pd.crosstab(t["a"].fillna('NA'), t["b"].fillna('NA')))
b   2  3  NA
a           
0   1  0   0
1   0  0   1
NA  0  1   1

EDIT: Add new row filled by NA and then subtract 1 from intersection NA, NA:

t = pd.DataFrame({"a": [0, 1, 2, 3], "b":[2, pd.NA, 3, pd.NA]})

df = t.append(pd.DataFrame('NA', index=[-1], columns=t.columns)).fillna('NA')
df = pd.crosstab(df["a"], df["b"])
df.loc['NA','NA'] -= 1
print(df)
b   2  3  NA
a           
0   1  0   0
1   0  0   1
2   0  1   0
3   0  0   1
NA  0  0   0

With groupby.size + stack is possible use:

t = pd.DataFrame({"a": [0, 1, 2, 3], "b":[2, pd.NA, 3, pd.NA]})

df = t.append(pd.DataFrame(np.nan, index=[-1], columns=t.columns))
df = df.groupby(['a', 'b'], dropna = False).size().unstack(fill_value=0)
df.loc[np.nan,np.nan] -= 1
print(df)
b    2.0  3.0  NaN
a                 
0.0    1    0    0
1.0    0    0    1
2.0    0    1    0
3.0    0    0    1
NaN    0    0    0

This is what I was thinking of and mentioned in the last part of the post, but if one column has no missing data, NA will not show up in the results for that column. I can edit the original post to reflect that with another example. — Uretki, Dec 02 '21 at 10:41

score 2 · Answer 2 · answered Dec 02 '21 at 10:44

2

crosstab is a convenience option, wrapped around pd.pivot_table; you could go straight to groupby (pd.pivot_table is a wrapper around groupby) and replicate your output:

df.groupby(['a', 'b'], dropna = False).size().unstack(fill_value=0)

b    2.0  3.0  NaN
a                 
0.0    1    0    0
1.0    0    0    1
NaN    0    1    1

It would also be helpful, if you could maybe make a PR on pandas to improve the crosstab functionality

answered Dec 02 '21 at 10:44

sammywemmy

27,093
4
17
31

This looks cleaner than with fillna, but it's still missing the case where one series has no missing data and as such has no NA value. Both answers still answer the most important part of the question I guess so I'll accept it if nothing comes up / I can't think of a solution. – Uretki Dec 02 '21 at 11:00
Forgot to add, regarding the PR, I couldn't pretend I'd be able to come up with a clean optimized solution and submit it. If you meant report the issue it's already been done. – Uretki Dec 02 '21 at 11:03
you are right, since one series does not have NA, it is not included in there; not sure of a way to handle it; for the PR, it doesnt matter for optimized ... i believe in baby steps ... we are all learners – sammywemmy Dec 02 '21 at 11:04
You could probably reindex again; but at this point, it starts getting gnarly in my opinion – sammywemmy Dec 02 '21 at 11:05

Clean alternative to pandas' broken crosstab for NA values

2 Answers2