I have a Spark dataframe with around 1 million rows. I am using pyspark and have to apply box-cox transformation from scipy library on each column of the dataframe. But the box-cox function allows only 1-d numpy array as input. How can I do this efficiently?
Is numpy array distributed on spark or it collects all the elements to single node on which driver program is running?
suppose df is my dataframe with column as C1
then, I want to perform the operation similar to this
stats.boxcox(df.select("C1"))