0

I am very aware of this link: Is there a pythonic way to do a contingency table in Pandas? Where two columns are converted to a 2x2 contingency and then fischer exact test can be done easily. What i am struggling with is how can i take the above mentioned solution and apply it to multiple columns. My data looks like this:

Samples A   B   C   D   E   F   G   H   I   J   K   L   M   N   0   P
AA1 1   1   0   0   1   0   0   0   1   1   0   0   1   0   0   0
AA2 1   0   1   0   0   0   0   0   1   0   1   0   0   0   0   0
AA3 1   0   1   0   0   1   0   0   1   0   1   0   0   1   0   0
AA4 0   0   0   0   1   0   0   0   0   0   0   0   1   0   0   0
AA5 0   0   0   1   0   1   0   0   0   0   0   1   0   1   0   0
AA6 1   0   0   0   1   0   0   0   1   0   0   0   1   0   0   0
AA7 1   0   0   0   0   0   0   0   1   0   0   0   0   0   0   0
AA8 1   0   0   0   1   1   0   0   1   0   0   0   1   1   0   0
AA9 0   0   0   0   1   0   0   1   0   0   0   0   1   0   0   1
AA10    1   1   1   0   1   0   0   1   1   1   1   0   1   0   0   1
AA11    1   0   1   1   0   1   0   1   1   0   1   1   0   1   0   1
AA12    1   0   1   0   0   0   1   1   1   0   1   0   0   0   1   1
AA13    0   0   0   0   1   0   1   1   0   0   0   0   1   0   1   1
AA14    1   1   0   0   0   1   1   1   1   1   0   0   0   1   1   1
AA15    1   0   0   1   0   0   1   1   1   0   0   1   0   0   1   1
AA16    0   0   0   0   1   0   1   1   0   0   0   0   1   0   1   1
AA17    1   0   0   0   0   0   1   1   1   0   0   0   0   0   1   1
AA18    1   1   1   0   0   0   0   1   1   1   1   0   0   0   0   1
AA19    1   0   0   1   1   1   0   1   1   0   0   1   1   1   0   1
AA20    0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   1
AA21    1   1   1   0   0   1   1   1   1   1   1   0   0   1   1   1
AA22    0   0   0   1   0   0   1   1   0   0   0   1   0   0   1   1
AA23    0   0   0   0   0   0   0   1   0   0   0   0   0   0   0   1

and i would like an output like the following (note the pvalues below are fake and donot represent the above data):

    A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P
A                                                               
B   0.05                                                            
C   0.9 0.9                                                     
D   0.4 0.8 0.8                                                 
E   0.002   0.001   0.8 0.02                                                
F   0.12    0.67    0.001   0.8 0.6                                         
G   0.9 0.9 0.8 0.01    0.8 0.8                                     
H   0.1 0.22    0.8 0.8 0.8 0.33    0.01                                    
I   0.05    0.05    0.05    0.05    0.05    0.05    0.05    0.05                                
J   0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9                         
K   0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.8 0.9 0.9                     
L   0.02    0.02    0.02    0.02    0.02    0.02    0.02    0.02    0.8 0.02    0.02                    
M   0.12    0.12    0.12    0.12    0.12    0.12    0.12    0.67    0.001   0.8 0.6 0.8             
N   0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.9 0.8 0.01    0.8 0.8 0.8         
O   0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.22    0.8 0.8 0.8 0.33    0.01    0.01        
P   0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.22    0.8 0.8 0.8 0.33    0.01    0.01    0.01    

The code so far is:

import pandas as pd
from scipy.stats import fisher_exact
df = pd.read_table("......")
df.set_index("Samples", inplace=True)
print(df.head())
print(pd.crosstab(df.A, df.B))
tab = pd.crosstab(df.A, df.B)
print(fisher_exact(tab))

This gives me a 2x2 and the p-val but i dont know how to iterate through the columns. Like A vs B, A vs C, A vs D...etc...etc.. and put the p-vals in the output format.

Any guidance is highly appreciated!

RnD
  • 1,172
  • 4
  • 15
  • 25
  • 1
    What have you tried so far, and what didn't work? Have you tried passing more columns into the `crosstab` function from the second answer? What was the result? – G. Anderson Mar 26 '19 at 22:16
  • please look above. – RnD Mar 26 '19 at 23:28
  • @G.Anderson . i also tried to pass more parameters into the crosstab function. It only accepts, 2 to 10 parameters and i have 35. Above is just a dummy dataset shown – RnD Mar 27 '19 at 00:28
  • any comments ?? – RnD Mar 27 '19 at 23:05

1 Answers1

0

Following is the code that does the job.

import pandas as pd
import os
from scipy.stats import fisher_exact

dirpath="...."
df = pd.read_table(".....")
df.set_index("Sample ID", inplace=True)

my_df = pd.DataFrame(index=df.columns, columns=df.columns)
for colout in df.columns:
    for colinner in df.columns:
        if(colout==colinner):
            my_df.at[colout,colinner]=0
        else:
            tab = pd.crosstab(df[colout],df[colinner])
            fish_vals = fisher_exact(tab)
            my_df.at[colout,colinner]=fish_vals[1]
my_df.to_csv(os.path.join(dirpath,'myfile.txt'), sep='\t', encoding='utf-8',quoting=0, index=True)
print(my_df)
RnD
  • 1,172
  • 4
  • 15
  • 25