1

I am using pd.crosstab to count presence/absence data. In the first column, I have several presence counts (represented by 1's), in the second column I have just one 'presence'. Howwever, when I run crosstab on this data that single presence in the second column isn't counted. Could anyone shed some light on why this happening and what I'm doing wrong?

Python v. 3.8.5 Pandas v. 1.2.3 System: MacOS Monterey v. 12.5.1

Column1:

>>> mbx_final['Cmpd1640']
OV745_1A    0
OV745_1B    0
OV745_1C    1
OV745_1D    1
OV745_1E    0
OV745_4A    1
OV745_4B    1
OV745_4C    0
OV22_12A    1
OV22_12B    1
OV22_12C    1
OV22_12D    0
OV22_12E    0
OV22_12F    0
OV22_13A    0
OV22_13B    0
OV22_13C    0
OV86_6A     1
OV86_6D     1
OV86_6E     1
OV86_6F     1
OV86_6G     1
OV86_6H     1
OV86_6I     1
OV86_6J     1
OV86_6K     0
OV86_6L     1
OV86_8A     1
OV86_8B     1
OV86_8C     1
OB1B        1
OB1C        1
SK3A        0
SK3B        0
SK3C        0
SK7A        1
SK7B        0 

Column2:

>>> mgx_final['Otu2409']
OV745_1A    0
OV745_1B    0
OV745_1C    0
OV745_1D    0
OV745_1E    0
OV745_4A    0
OV745_4B    0
OV745_4C    0
OV22_12A    0
OV22_12B    0
OV22_12C    0
OV22_12D    0
OV22_12E    0
OV22_12F    0
OV22_13A    0
OV22_13B    0
OV22_13C    0
OV86_6A     0
OV86_6D     0
OV86_6E     0
OV86_6F     0
OV86_6G     0
OV86_6H     0
OV86_6I     0
OV86_6J     0
OV86_6K     0
OV86_6L     0
OV86_8A     0
OV86_8B     0
OV86_8C     0
OB1A        1
OB1C        0
SK3A        0
SK3B        0
SK3C        0
SK7A        0
SK7B        0

Crosstab command:

contingency_tab = pd.crosstab(mbx_final['Cmpd1640'],mgx_final['Otu2409'],margins=True)

Results:

>>> contingency_tab
Otu2409    0  All
Cmpd1640         
0         15   15
1         21   21
All       36   36

I would expect to see a result like this:

>>> contingency_tab
Otu2409    0   1   All
Cmpd1640         
0         15   0   15
1         21   1   22
All       36   1   37

What am I doing wrong?

1 Answers1

2

You can use the dropna parameter, which is by default set to True. Setting it to False will include columns whose entries are all NaN.

contingency_tab = pd.crosstab(mbx_final['Cmpd1640'],mgx_final['Otu2409'],margins=True, dropna=False)

You can read more on the official documentation here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html


Edit 1:

I've replicated your dataset and code and run the following:

df_in = pd.read_excel("Book1.xlsx", index_col="index")
mbx_final = df_in[["Cmpd1640"]]
mgx_final = df_in[["Otu2409"]]
contingency_tab  = pd.crosstab(mbx_final['Cmpd1640'], mgx_final['Otu2409'], margins=True)
display(contingency_tab)

And I get your expected output:

enter image description here

There might be something wrong with how you're displaying the crosstab function output.

Adam Jaamour
  • 1,326
  • 1
  • 15
  • 31
  • Just tried that and no change to the result? I still don't understand why its missing the "1" in column2? Surely that shouldn't be counted as a Nan? – Sam Nmr Gogga Rainman Nov 21 '22 at 14:28
  • What's the datatype of the column? (You can check with `mbx_final.dtypes`) – Adam Jaamour Nov 21 '22 at 14:30
  • All int64 for both columns. >>> mgx_final.dtypes.unique() array([dtype('int64')], dtype=object) >>> mbx_final.dtypes.unique() array([dtype('int64')], dtype=object) – Sam Nmr Gogga Rainman Nov 21 '22 at 14:35
  • @SamNmrGoggaRainman I've replicated your data and code, and I get your expected output (see my answer update). There might be something wrong with how you output your crosstab output? Are you running it in Jupyter notebook? – Adam Jaamour Nov 21 '22 at 15:31
  • Hi! Thanks for being such a great help with this! I do see that you have handled the data slightly differently - I'm just pointing to different columns in two dataframes for the crosstab, whereas you're slicing the columns out and then pointing to that. I would guess my method is failing somehow. Let me try slicing and see what happens! – Sam Nmr Gogga Rainman Nov 21 '22 at 15:46
  • Update: Your route worked but also helped me identify the root problem! I was using row names as my indices and the indices didn't match, leading to chaos! Thanks again for your help!! :D – Sam Nmr Gogga Rainman Nov 21 '22 at 16:00
  • @SamNmrGoggaRainman I would just have both columns in one single dataframe, that'd be better! – Adam Jaamour Nov 21 '22 at 16:04