0

I have pyspark Data frame for which want to calculate summary statistics (count of all unique categories in that column) and crossTabulation with one fixed column for all string columns. For Example: My df is like this

col1 col2 col3
Cat1 XYZ A
Cat1 XYZ C
Cat1 ABC B
Cat2 ABC A
Cat2 XYZ B
Cat2 MNO A

I want something like this

VarNAME Category Count A B C
col1 Cat1 3 1 1 1
col1 Cat2 3 2 0 1
col2 XYZ 3 1 1 1
col2 ABC 2 1 1 0
col2 MNO 1 1 0 0
col3 A 3 3 0 0
col3 B 2 0 2 0
Col3 C 1 0 0 1

So, Basically, I want cross-tabulation for all individual columns with col3 and the total count. I can do it in Python using a loop but the loop is somewhat different in pyspark.

ASD
  • 25
  • 6
  • Does this answer your question? [Pyspark Dataframe pivot and groupby count](https://stackoverflow.com/questions/58557498/pyspark-dataframe-pivot-and-groupby-count) – flaviut Sep 19 '22 at 20:00
  • yes, but partially. I have many columns so I was wondering whether it is possible to do it without a loop – ASD Sep 21 '22 at 07:05

1 Answers1

0

Here are my 2 cents.

  1. Created a sample dataframe

     df = spark.createDataFrame(
         [("Cat1","XYZ","A"),
         ("Cat1","XYZ","C"),
         ("Cat1","ABC","B"),
         ("Cat2","ABC","A"),
         ("Cat2","XYZ","B"),
         ("Cat2","MNO","A")
     ],schema = ['col1','col2','col3'])
    
  2. Used Crosstab function which will calculate the count for all the col3, evaluates the total row count, then created a new constant column based on the column name and renamed it. Then performed union for all these dataframes

     from pyspark.sql.functions import *
     import pyspark.sql.functions as fx
    
     df_union = \
     df.crosstab('col1','col3').withColumn('count',fx.expr(("A+B+C"))).withColumn('VarName',lit('col1')).withColumnRenamed('col1_col3','Category').union(
     df.crosstab('col2','col3').withColumn('count',fx.expr(("A+B+C"))).withColumn('VarName',lit('col2')).withColumnRenamed('col2_col3','Category')).union(
     df.crosstab('col3','col3').withColumn('count',fx.expr(("A+B+C"))).withColumn('VarName',lit('col3')).withColumnRenamed('col3_col3','Category'))
    
  3. Printing the data frame based on the column order

    df_union.select('VarName','Category','count','A','B','C').show()
    

Please check the sample output for the reference: enter image description here

Banu
  • 146
  • 5
  • Thanks. But I have more than 100 columns. it will be difficult to write this for 100 columns. I have done almost same thing but using loop to iterate over column but it very take long time . I have more than 1 million record. – ASD Oct 10 '22 at 08:25