I need to profile data coming from snowflake in Databricks. The data is just a sample of 100 rows but containing 3k+ columns, and will eventually have more rows. When I reduce the number of columns, the profiling is done very fast but the more columns there are, the longer it gets. I tried profiling the sample and after more than 10h and I had to cancel the job.
Here is the code I use
df = spark.read.format('snowflake').options(**sfOptions).option('query', f'select * from {db_name}')
df_ge = ge.dataset.SparkDFDataset(df_sf)
BasicDatasetProfiler.profile(df_ge)
You can test this with any data having a lot of columns. Is this normal or am I doing something wrong?