Summary and crosstabulation in Pyspark (DataBricks)

Question

I have pyspark Data frame for which want to calculate summary statistics (count of all unique categories in that column) and crossTabulation with one fixed column for all string columns. For Example: My df is like this

col1	col2	col3
Cat1	XYZ	A
Cat1	XYZ	C
Cat1	ABC	B
Cat2	ABC	A
Cat2	XYZ	B
Cat2	MNO	A

I want something like this

VarNAME	Category	Count	A	B	C
col1	Cat1	3	1	1	1
col1	Cat2	3	2	0	1
col2	XYZ	3	1	1	1
col2	ABC	2	1	1	0
col2	MNO	1	1	0	0
col3	A	3	3	0	0
col3	B	2	0	2	0
Col3	C	1	0	0	1

So, Basically, I want cross-tabulation for all individual columns with col3 and the total count. I can do it in Python using a loop but the loop is somewhat different in pyspark.

Does this answer your question? [Pyspark Dataframe pivot and groupby count](https://stackoverflow.com/questions/58557498/pyspark-dataframe-pivot-and-groupby-count) — flaviut, Sep 19 '22 at 20:00
yes, but partially. I have many columns so I was wondering whether it is possible to do it without a loop — ASD, Sep 21 '22 at 07:05

score 0 · Answer 1 · answered Oct 08 '22 at 20:30

Here are my 2 cents.

Created a sample dataframe

 df = spark.createDataFrame(
     [("Cat1","XYZ","A"),
     ("Cat1","XYZ","C"),
     ("Cat1","ABC","B"),
     ("Cat2","ABC","A"),
     ("Cat2","XYZ","B"),
     ("Cat2","MNO","A")
 ],schema = ['col1','col2','col3'])

Used Crosstab function which will calculate the count for all the col3, evaluates the total row count, then created a new constant column based on the column name and renamed it. Then performed union for all these dataframes

 from pyspark.sql.functions import *
 import pyspark.sql.functions as fx

 df_union = \
 df.crosstab('col1','col3').withColumn('count',fx.expr(("A+B+C"))).withColumn('VarName',lit('col1')).withColumnRenamed('col1_col3','Category').union(
 df.crosstab('col2','col3').withColumn('count',fx.expr(("A+B+C"))).withColumn('VarName',lit('col2')).withColumnRenamed('col2_col3','Category')).union(
 df.crosstab('col3','col3').withColumn('count',fx.expr(("A+B+C"))).withColumn('VarName',lit('col3')).withColumnRenamed('col3_col3','Category'))

Printing the data frame based on the column order

df_union.select('VarName','Category','count','A','B','C').show()

Please check the sample output for the reference:

Thanks. But I have more than 100 columns. it will be difficult to write this for 100 columns. I have done almost same thing but using loop to iterate over column but it very take long time . I have more than 1 million record. — ASD, Oct 10 '22 at 08:25

Summary and crosstabulation in Pyspark (DataBricks)

1 Answers1