1

1) Most of the features are NOT normally distributed

2) Features contain missing data (less than 50% per feature), and I'd like to feel it later. I can't find how those modules handle None values. (Just see that in comparison with the same dataset filled with medians pearsonr > 0.7 finds 71 correlated features instead of 200+)

Everything is stored in pandas dataframe, so technically I pass list(df.column_name) to pearsonr.


UPD: Well, found pandas.DataFrame.corr:

  • offers both pearson, kendall, spearman
  • clearly says it's excluding NA/null values
  • says nothing about requirements or other details

It's too seductive to deny, so I'll go with it (and Spearman as my friend who can into math stats recommends). But nevertheless - lazy pandas aren't always there for you.

Acia Delilah
  • 73
  • 2
  • 6

0 Answers0