I have a PySpark df
:
Store_ID | Category | ID | Sales |
---|---|---|---|
1 | A | 123 | 23 |
2 | A | 123 | 45 |
1 | A | 234 | 67 |
1 | B | 567 | 78 |
2 | B | 567 | 34 |
3 | D | 789 | 12 |
1 | A | 890 | 12 |
Expected:
Store_ID | A_ID | B_ID | C_ID | D_ID | Sales_A | Sales_B | Sales_C | Sales_D |
---|---|---|---|---|---|---|---|---|
1 | 3 | 1 | 0 | 0 | 102 | 78 | 0 | 0 |
2 | 1 | 1 | 0 | 0 | 45 | 34 | 0 | 0 |
3 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 12 |
I am able to transform this way using SQL (created a temp view):
SELECT Store_Id,
SUM(IF(Category='A',Sales,0)) AS Sales_A,
SUM(IF(Category='B',Sales,0)) AS Sales_B,
SUM(IF(Category='C',Sales,0)) AS Sales_C,
SUM(IF(Category='D',Sales,0)) AS Sales_D,
COUNT(DISTINCT NULLIF(IF(Category='A',ID,0),0)) AS A_ID,
COUNT(DISTINCT NULLIF(IF(Category='B',ID,0),0)) AS B_ID,
COUNT(DISTINCT NULLIF(IF(Category='C',ID,0),0)) AS C_ID,
COUNT(DISTINCT NULLIF(IF(Category='D',ID,0),0)) AS D_ID
FROM df
GROUP BY Store_Id;
How do we achieve the same in PySpark using native functions as its much faster?