Create multiple columns by pivoting even when pivoted value doesn't exist

Question

I have a PySpark df:

Store_ID	Category	ID	Sales
1	A	123	23
2	A	123	45
1	A	234	67
1	B	567	78
2	B	567	34
3	D	789	12
1	A	890	12

Expected:

Store_ID	A_ID	B_ID	D_ID	Sales_A	Sales_B	Sales_D
1	3	1	0	102	78	0
2	1	1	0	45	34	0
3	0	0	1	0	0	12

I am able to transform this way using SQL (created a temp view):

SELECT Store_Id,
       SUM(IF(Category='A',Sales,0)) AS Sales_A,
       SUM(IF(Category='B',Sales,0)) AS Sales_B,
       SUM(IF(Category='C',Sales,0)) AS Sales_C,
       SUM(IF(Category='D',Sales,0)) AS Sales_D,
       COUNT(DISTINCT NULLIF(IF(Category='A',ID,0),0)) AS A_ID,
       COUNT(DISTINCT NULLIF(IF(Category='B',ID,0),0)) AS B_ID,
       COUNT(DISTINCT NULLIF(IF(Category='C',ID,0),0)) AS C_ID,
       COUNT(DISTINCT NULLIF(IF(Category='D',ID,0),0)) AS D_ID
FROM df
GROUP BY Store_Id;

How do we achieve the same in PySpark using native functions as its much faster?

what you need is `pivot()`. see [this](https://stackoverflow.com/q/72787341/8279585) SO Q's answer for details — samkart, Jun 29 '22 at 06:58
possible duplicate of [aggregate pyspark dataframe and create multiple columns](https://stackoverflow.com/q/72787341/8279585) — samkart, Jun 29 '22 at 06:58
can you answer the question? That is a differnt objective and im stuck — Scope, Jun 29 '22 at 07:00

ZygD · Accepted Answer · 2022-06-29T08:28:45.923

This operation is called pivoting.

a couple of aggregations, since you need both, count of ID and sum of Sales
alias for aggregations, for changing column names
providing values in pivot, for cases where you want numbers for Category C, but C doesn't exist. Providing values boosts performance too.

Input:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [(1, 'A', 123, 23),
     (2, 'A', 123, 45),
     (1, 'A', 234, 67),
     (1, 'B', 567, 78),
     (2, 'B', 567, 34),
     (3, 'D', 789, 12),
     (1, 'A', 890, 12)],
    ['Store_ID', 'Category', 'ID', 'Sales'])

Script:

df = (df
    .groupBy('Store_ID')
    .pivot('Category', ['A', 'B', 'C', 'D'])
    .agg(
        F.countDistinct('ID').alias('ID'),
        F.sum('Sales').alias('Sales'))
    .fillna(0))
df.show()
# +--------+----+-------+----+-------+----+-------+----+-------+
# |Store_ID|A_ID|A_Sales|B_ID|B_Sales|C_ID|C_Sales|D_ID|D_Sales|
# +--------+----+-------+----+-------+----+-------+----+-------+
# |       1|   3|    102|   1|     78|   0|      0|   0|      0|
# |       3|   0|      0|   0|      0|   0|      0|   1|     12|
# |       2|   1|     45|   1|     34|   0|      0|   0|      0|
# +--------+----+-------+----+-------+----+-------+----+-------+

Thank you . I really appreciate your response. A small question here ..Will the `count `used give the distinct count of ID column or will it count everything?In case I want to count distinct Ids only what should I do? — Scope, Jun 29 '22 at 08:29
I've updated the answer while you were testing. I've replaced `count` with `countDistinct` — ZygD, Jun 29 '22 at 08:31

Create multiple columns by pivoting even when pivoted value doesn't exist

1 Answers1