-1

Here's my pandas dataframe with my data:

        c0  c1  c10 c11 c12 c13 c14 c15 c16 c3  c4  c5  c6  c7  c8  c9
index                                                               
0   1   49  2.0 0   2   2   0   1   6797.761892 130 269.0   0   1   163 0   0.0
1   0   61  0.0 1   2   2   1   3   4307.686943 138 166.0   0   0   125 1   3.6
2   0   46  0.0 2   3   2   0   1   4118.077502 140 311.0   0   1   120 1   1.8
3   0   69  1.0 3   3   2   1   0   7170.849469 140 254.0   0   0   146 0   2.0
4   0   51  1.0 0   2   2   1   0   5579.040145 100 222.0   0   1   143 1   1.2
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
283 0   54  0.0 1   2   2   2   0   6293.123474 125 273.0   0   0   152 0   0.5
284 0   42  0.0 0   3   2   0   1   3303.841931 120 240.0   1   1   194 0   0.8
285 1   67  0.0 2   2   2   1   0   3383.029119 106 223.0   0   1   142 0   0.3
286 0   67  1.0 2   3   2   0   2   768.900795  125 254.0   1   1   163 0   0.2
287 0   60  0.0 1   3   2   0   0   1508.832825 130 253.0   0   1   144 1   1.4
288 rows × 16 columns

I've used statsmodels to obtain the p_value:

log = sm.Logit(df['c0'], df.loc[:, df.columns != 'c0']).fit()
d1 = pd.DataFrame(index=log.pvalues.index, data=log.pvalues, columns=['statsmodels_pvalue'])

And then I've used scipy module also. The personr function returns the correlation and pvalue, I'm appending the return [1] as you can see.

index = []
output = []
for i in df.columns[1:]:
    index.append(i)
    output.append(pearsonr(df['c0'], df[i]) [1])    
d2 = pd.DataFrame(index=index, data=output, columns=['pearson_pvalue'])

pd.concat([d1,d2], axis=1)

Results:

    statsmodels_pvalue  pearson_pvalue
c1  0.155704            0.105977
c10 0.449688            0.697069
c11 0.041694            0.038457
c12 0.000269            0.000510
c13 0.012123            0.046765
c14 0.000114            0.000087
c15 0.587200            0.843444
c16 0.301656            0.025142
c3  0.434319            0.330075
c4  0.000163            0.000014
c5  0.792058            0.613432
c6  0.340877            0.454607
c7  0.843758            0.562002
c8  0.365109            0.030531
c9  0.238975            0.070500
Al777
  • 9
  • 3

1 Answers1

0

Few points related to statistics, you can also check out post like this. Pearson and linear regression is only equivalent when you fit with an intercept and consider 1 dependent variable. You are doing a multiple regression and that does not work. Lastly, you need to do a ordinary least square, not logistic regression.

Below will reproduce the p-values for both:

from scipy.stats import pearsonr
import pandas as pd
import numpy as np
import statsmodels.api as sm

np.random.seed(111)

df = pd.DataFrame({'c0':np.random.uniform(0,1,50),
                   'c1':np.random.uniform(0,1,50),
                   'c2':np.random.uniform(0,1,50)})

variables = df.columns[1:]
output = []
for i in variables:
    lm = sm.OLS(df['c0'], sm.add_constant(df.loc[:, i])).fit()
    lm_p = lm.pvalues[1]
    pearson_p = pearsonr(df['c0'], df[i]) [1]
    output.append([lm_p,pearson_p])    

pd.DataFrame(output,index=variables,columns=['lm_p','pearson_p'])

    lm_p    pearson_p
c1  0.062513    0.062513
c2  0.781529    0.781529
StupidWolf
  • 45,075
  • 17
  • 40
  • 72
  • In my case, my Y variable is binary (two classes), how to I apply your line of thinking to my problem? Should I still treat it as a number (OLS) instead of a two class problem? (Logit) ? – Al777 Nov 11 '20 at 15:08
  • hey, your question is why the p values differ and the answer is because you are using different regressions. If you want to find out which predictor is associated with your Y variable, then you should use logistic. Hope this is clear – StupidWolf Nov 11 '20 at 15:10