0

I need to analyze a huge table with approx 7 millions lines and 20 columuns. I can read data in a dataframe without using Spark, but I can't have enough memory for computation.

Does someone know if the package can work in distributed spark environment?

I read the docs at https://ydata-profiling.ydata.ai/docs/master/pages/integrations/pyspark.html but I can't understand if the package can only read data from a "spark dataframe" or it entirely works on spark. In the first case I think it doesn't solve my memory issue and I need to compute correlations so I can't use "minimal" option.

Simocrep
  • 3
  • 2

1 Answers1

0

ydata-profiling does work with Spark.

You only need to provide a pypsark Dataframe as input. HAve a look into their Databricks example: https://github.com/ydataai/ydata-profiling/tree/master/examples/integrations/databricks

FabC
  • 26
  • 3