python - Proper way to find correlations between features containing missing data

Asked Oct 04 '17 at 21:29

Active Oct 05 '17 at 01:20

Viewed 336 times

1) Most of the features are NOT normally distributed

Just realized that SciPy pearsonr I used requires normal distribution (what's indeed weird).
Numpy.corrcoef description says nothing about such requirements. Should I use it? Other recommendations?

2) Features contain missing data (less than 50% per feature), and I'd like to feel it later. I can't find how those modules handle None values. (Just see that in comparison with the same dataset filled with medians pearsonr > 0.7 finds 71 correlated features instead of 200+)

Everything is stored in pandas dataframe, so technically I pass list(df.column_name) to pearsonr.

UPD: Well, found pandas.DataFrame.corr:

offers both pearson, kendall, spearman
clearly says it's excluding NA/null values
says nothing about requirements or other details

It's too seductive to deny, so I'll go with it (and Spearman as my friend who can into math stats recommends). But nevertheless - lazy pandas aren't always there for you.

edited Oct 05 '17 at 01:20

asked Oct 04 '17 at 21:29

Acia Delilah

python - Proper way to find correlations between features containing missing data

0 Answers0