1

I have a PySpark df:

Store_ID Category ID Sales
1 A 123 23
2 A 123 45
1 A 234 67
1 B 567 78
2 B 567 34
3 D 789 12
1 A 890 12

Expected:

Store_ID A_ID B_ID C_ID D_ID Sales_A Sales_B Sales_C Sales_D
1 3 1 0 0 102 78 0 0
2 1 1 0 0 45 34 0 0
3 0 0 0 1 0 0 0 12

I am able to transform this way using SQL (created a temp view):

SELECT Store_Id,
       SUM(IF(Category='A',Sales,0)) AS Sales_A,
       SUM(IF(Category='B',Sales,0)) AS Sales_B,
       SUM(IF(Category='C',Sales,0)) AS Sales_C,
       SUM(IF(Category='D',Sales,0)) AS Sales_D,
       COUNT(DISTINCT NULLIF(IF(Category='A',ID,0),0)) AS A_ID,
       COUNT(DISTINCT NULLIF(IF(Category='B',ID,0),0)) AS B_ID,
       COUNT(DISTINCT NULLIF(IF(Category='C',ID,0),0)) AS C_ID,
       COUNT(DISTINCT NULLIF(IF(Category='D',ID,0),0)) AS D_ID
FROM df
GROUP BY Store_Id;

How do we achieve the same in PySpark using native functions as its much faster?

ZygD
  • 22,092
  • 39
  • 79
  • 102
Scope
  • 727
  • 4
  • 15
  • what you need is `pivot()`. see [this](https://stackoverflow.com/q/72787341/8279585) SO Q's answer for details – samkart Jun 29 '22 at 06:58
  • possible duplicate of [aggregate pyspark dataframe and create multiple columns](https://stackoverflow.com/q/72787341/8279585) – samkart Jun 29 '22 at 06:58
  • can you answer the question? That is a differnt objective and im stuck – Scope Jun 29 '22 at 07:00

1 Answers1

2

This operation is called pivoting.

  • a couple of aggregations, since you need both, count of ID and sum of Sales
  • alias for aggregations, for changing column names
  • providing values in pivot, for cases where you want numbers for Category C, but C doesn't exist. Providing values boosts performance too.

Input:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [(1, 'A', 123, 23),
     (2, 'A', 123, 45),
     (1, 'A', 234, 67),
     (1, 'B', 567, 78),
     (2, 'B', 567, 34),
     (3, 'D', 789, 12),
     (1, 'A', 890, 12)],
    ['Store_ID', 'Category', 'ID', 'Sales'])

Script:

df = (df
    .groupBy('Store_ID')
    .pivot('Category', ['A', 'B', 'C', 'D'])
    .agg(
        F.countDistinct('ID').alias('ID'),
        F.sum('Sales').alias('Sales'))
    .fillna(0))
df.show()
# +--------+----+-------+----+-------+----+-------+----+-------+
# |Store_ID|A_ID|A_Sales|B_ID|B_Sales|C_ID|C_Sales|D_ID|D_Sales|
# +--------+----+-------+----+-------+----+-------+----+-------+
# |       1|   3|    102|   1|     78|   0|      0|   0|      0|
# |       3|   0|      0|   0|      0|   0|      0|   1|     12|
# |       2|   1|     45|   1|     34|   0|      0|   0|      0|
# +--------+----+-------+----+-------+----+-------+----+-------+
ZygD
  • 22,092
  • 39
  • 79
  • 102
  • Thank you . I really appreciate your response. A small question here ..Will the `count `used give the distinct count of ID column or will it count everything?In case I want to count distinct Ids only what should I do? – Scope Jun 29 '22 at 08:29
  • I've updated the answer while you were testing. I've replaced `count` with `countDistinct` – ZygD Jun 29 '22 at 08:31