Pyspark DataFrame - Discretize the selected numerical column and then apply groupby and crosstab function

Question

I have dataframe which has 100+ numerical columns. I want to descretize some columns from it and then apply groupby function and crosstab function on these discretized columns.

Currently, I am using a loop to iterate over all selected numerical columns. But it is very time-consuming. is there any better and cleaner solution? My code looks like below:

from pyspark.ml.feature import QuantileDiscretizer
df_num = spark.createDataFrame(data = [],schema = StructType([]))
for name in number_columns:

  steps = QuantileDiscretizer(numBuckets=10,inputCol=name,outputCol=name+'Bin')
  Selected_data=steps.fit(Selected_data).transform(Selected_data)
  tmp=Selected_data.groupBy(name+'Bin').agg(mean("ABC"),mean("XYZ"),count("ABC"),count("XYZ")).withColumnRenamed(name+'Bin','Category')
                                               
  temp=Selected_data.crosstab(name+'Bin', 'code').withColumnRenamed(name+'Bin_code','Category')
  temp=temp.join(tmp, 'Category','inner')
  df_num=df_num.unionByName(temp,allowMissingColumns=True)

Pyspark DataFrame - Discretize the selected numerical column and then apply groupby and crosstab function

0 Answers0