1) Most of the features are NOT normally distributed
- Just realized that SciPy pearsonr I used requires normal distribution (what's indeed weird).
- Numpy.corrcoef description says nothing about such requirements. Should I use it? Other recommendations?
2) Features contain missing data (less than 50% per feature), and I'd like to feel it later. I can't find how those modules handle None values. (Just see that in comparison with the same dataset filled with medians pearsonr > 0.7 finds 71 correlated features instead of 200+)
Everything is stored in pandas dataframe, so technically I pass list(df.column_name) to pearsonr.
UPD: Well, found pandas.DataFrame.corr:
- offers both pearson, kendall, spearman
- clearly says it's excluding NA/null values
- says nothing about requirements or other details
It's too seductive to deny, so I'll go with it (and Spearman as my friend who can into math stats recommends). But nevertheless - lazy pandas aren't always there for you.