0

I need to profile data coming from snowflake in Databricks. The data is just a sample of 100 rows but containing 3k+ columns, and will eventually have more rows. When I reduce the number of columns, the profiling is done very fast but the more columns there are, the longer it gets. I tried profiling the sample and after more than 10h and I had to cancel the job.

Here is the code I use

df = spark.read.format('snowflake').options(**sfOptions).option('query', f'select * from {db_name}')

df_ge = ge.dataset.SparkDFDataset(df_sf)

BasicDatasetProfiler.profile(df_ge)

You can test this with any data having a lot of columns. Is this normal or am I doing something wrong?

1 Answers1

0

Basically, GE computes metrics for each columns individually, hence, it make an action (probably a collect) for each column and each metric it computes. collects are the most expensive operations you can have on spark so that is almost normal that, the more columns you have, the longer it takes.

Steven
  • 14,048
  • 6
  • 38
  • 73
  • With more testing, I saw that even with Pandas Dataframe, execution time can get very slow with many columns since the execution time is not linearly proportional to the number of columns. It seems quadratic instead. Any idea why? – Francis Gosselin Jul 19 '21 at 17:37