I have dataframe which has 100+ numerical columns. I want to descretize some columns from it and then apply groupby function and crosstab function on these discretized columns.
Currently, I am using a loop to iterate over all selected numerical columns. But it is very time-consuming. is there any better and cleaner solution? My code looks like below:
from pyspark.ml.feature import QuantileDiscretizer
df_num = spark.createDataFrame(data = [],schema = StructType([]))
for name in number_columns:
steps = QuantileDiscretizer(numBuckets=10,inputCol=name,outputCol=name+'Bin')
Selected_data=steps.fit(Selected_data).transform(Selected_data)
tmp=Selected_data.groupBy(name+'Bin').agg(mean("ABC"),mean("XYZ"),count("ABC"),count("XYZ")).withColumnRenamed(name+'Bin','Category')
temp=Selected_data.crosstab(name+'Bin', 'code').withColumnRenamed(name+'Bin_code','Category')
temp=temp.join(tmp, 'Category','inner')
df_num=df_num.unionByName(temp,allowMissingColumns=True)