missing columns for ydata-profiling correlation report

Question

I'm using ydata-profiling (the evolution of pandas-profiling) to compute correlation among columns of large datasets (e.g. 400411 rows and 27 columns). These are configurations in config.yaml:

correlations:
    pearson:
      calculate: false
      warn_high_correlations: false
      threshold: 0.9
    spearman:
      calculate: true
      warn_high_correlations: false
      threshold: 0.9
    kendall:
      calculate: false
      warn_high_correlations: false
      threshold: 0.9
    phi_k:
      calculate: false
      warn_high_correlations: false
      threshold: 0.9
    cramers:
      calculate: true
      warn_high_correlations: false
      threshold: 0.9
    auto:
       calculate: false
       warn_high_correlations: false
       threshold: 0.9

I need only Spearman for numerical data and Cramers' V for categorical ones. When I do

tmp_profiler = ydata_profiling.ProfileReport(df, config_file='config.yaml')

it computes correctly Sperman, but skips a lot of categorical columns in Cramers (with other datasets of same size it skips all of them).

I thought it was due to the presence of a lot of missing data, so I tried to fill Nan with empty string in those columns. It didn't work. I don't think is due to some configuration, since I tried to enlarge all values:

vars:
    cat:
        length: false
        characters: false
        words: false
        cardinality_threshold: 5000000
        n_obs: 5
        # Set to zero to disable
        chi_squared_threshold: 0.0
        coerce_str_to_date: false
        redact: false
        histogram_largest: 10
        stop_words: []
...
# For categorical
categorical_maximum_correlation_distinct: 10000000

report:
  precision: 1000

UPDATE: even if I use

tmp_profiler = ydata_profiling.ProfileReport(df.sample(100000), config_file='config.yaml')

there is the same issue.

Does someone have some explanation and solutions for this behaviour?

score 0 · Accepted Answer · answered May 31 '23 at 07:47

0

Are the columns with missing values correctly identified as categorical? One cause of the problem could be type inference. If your categorical columns have high cardinality, they may be inferred as text or another type.

A solution to overwrite inferred types:

prof = ProfileReport(
    df,
    config_file="config.yaml",
    type_schema={
        "column_1": "categorical",
        "column_2": "categorical",
    }
)
prof.to_file("profile.html")

With the features being considered categorical, they should appear on the correlations. It is also possible that your columns have so much missing data that your sample does not return any valid data...

answered May 31 '23 at 07:47

SeaEngineering

36
2

Thanks for the answer @SeaEngineering. Let me ask you few more things: 1. how can i manage cardinality of categorical data? (i suppose i have to use "cardinality_threshold" in the previous config file) 2. isn't it enough to pass string columns to let the profiler use them as categorical? 3. the problem can be related to missing data, i can check it if i got the answers of previous questions. – Simocrep Jun 01 '23 at 12:34
Seems that since ydata-profiling 4.2.0, the categorical threshold for strings is under the text in the config file. For now we can set the config as: `vars: text: categorical_threshold: 5000000 percentage_cat_threshold: 1.0 ` – SeaEngineering Jun 06 '23 at 11:14
Regarding 2., from 4.2 and on it is not enough, a new type of variable was introduced, to handle cases where free text was being considered categorical. In earlier versions, I believe it was enough. – SeaEngineering Jun 06 '23 at 12:16

missing columns for ydata-profiling correlation report

1 Answers1