Great Expectation profiling on SparkDF takes a long time when there are many columns

Question

I need to profile data coming from snowflake in Databricks. The data is just a sample of 100 rows but containing 3k+ columns, and will eventually have more rows. When I reduce the number of columns, the profiling is done very fast but the more columns there are, the longer it gets. I tried profiling the sample and after more than 10h and I had to cancel the job.

Here is the code I use

df = spark.read.format('snowflake').options(**sfOptions).option('query', f'select * from {db_name}')

df_ge = ge.dataset.SparkDFDataset(df_sf)

BasicDatasetProfiler.profile(df_ge)

You can test this with any data having a lot of columns. Is this normal or am I doing something wrong?

score 0 · Answer 1 · answered Jul 19 '21 at 16:02

0

Basically, GE computes metrics for each columns individually, hence, it make an action (probably a collect) for each column and each metric it computes. collects are the most expensive operations you can have on spark so that is almost normal that, the more columns you have, the longer it takes.

answered Jul 19 '21 at 16:02

Steven

14,048
6
38
73

With more testing, I saw that even with Pandas Dataframe, execution time can get very slow with many columns since the execution time is not linearly proportional to the number of columns. It seems quadratic instead. Any idea why? – Francis Gosselin Jul 19 '21 at 17:37

Great Expectation profiling on SparkDF takes a long time when there are many columns

1 Answers1