I'm using ydata-profiling (the evolution of pandas-profiling) to compute correlation among columns of large datasets (e.g. 400411 rows and 27 columns). These are configurations in config.yaml:
correlations:
pearson:
calculate: false
warn_high_correlations: false
threshold: 0.9
spearman:
calculate: true
warn_high_correlations: false
threshold: 0.9
kendall:
calculate: false
warn_high_correlations: false
threshold: 0.9
phi_k:
calculate: false
warn_high_correlations: false
threshold: 0.9
cramers:
calculate: true
warn_high_correlations: false
threshold: 0.9
auto:
calculate: false
warn_high_correlations: false
threshold: 0.9
I need only Spearman for numerical data and Cramers' V for categorical ones. When I do
tmp_profiler = ydata_profiling.ProfileReport(df, config_file='config.yaml')
it computes correctly Sperman, but skips a lot of categorical columns in Cramers (with other datasets of same size it skips all of them).
I thought it was due to the presence of a lot of missing data, so I tried to fill Nan with empty string in those columns. It didn't work. I don't think is due to some configuration, since I tried to enlarge all values:
vars:
cat:
length: false
characters: false
words: false
cardinality_threshold: 5000000
n_obs: 5
# Set to zero to disable
chi_squared_threshold: 0.0
coerce_str_to_date: false
redact: false
histogram_largest: 10
stop_words: []
...
# For categorical
categorical_maximum_correlation_distinct: 10000000
report:
precision: 1000
UPDATE: even if I use
tmp_profiler = ydata_profiling.ProfileReport(df.sample(100000), config_file='config.yaml')
there is the same issue.
Does someone have some explanation and solutions for this behaviour?