Does ydata-profiling works in a Spark Envirnoment?

Question

I need to analyze a huge table with approx 7 millions lines and 20 columuns. I can read data in a dataframe without using Spark, but I can't have enough memory for computation.

Does someone know if the package can work in distributed spark environment?

I read the docs at https://ydata-profiling.ydata.ai/docs/master/pages/integrations/pyspark.html but I can't understand if the package can only read data from a "spark dataframe" or it entirely works on spark. In the first case I think it doesn't solve my memory issue and I need to compute correlations so I can't use "minimal" option.

FabC · Answer 1 · 2023-08-02T22:31:18.747

0

ydata-profiling does work with Spark.

You only need to provide a pypsark Dataframe as input. HAve a look into their Databricks example: https://github.com/ydataai/ydata-profiling/tree/master/examples/integrations/databricks

edited Aug 02 '23 at 22:31

answered Aug 02 '23 at 22:30

FabC

26
3

Does ydata-profiling works in a Spark Envirnoment?

1 Answers1