Suppose a given dataframe:
Model | Color |
---|---|
Car | Red |
Car | Red |
Car | Blue |
Truck | Red |
Truck | Blue |
Truck | Yellow |
SUV | Blue |
SUV | Blue |
Car | Blue |
Car | Yellow |
I want to add color columns that keep a count of each color across each model to give the following dataframe:
Model | Color | Red | Blue | Yellow |
---|---|---|---|---|
Car | Red | 2 | 2 | 1 |
Car | Red | 2 | 2 | 1 |
Car | Blue | 2 | 2 | 1 |
Truck | Red | 1 | 1 | 1 |
Truck | Blue | 1 | 1 | 1 |
Truck | Yellow | 1 | 1 | 1 |
SUV | Blue | 0 | 2 | 0 |
SUV | Blue | 0 | 2 | 0 |
Car | Blue | 2 | 2 | 1 |
Car | Yellow | 2 | 2 | 1 |
This dataset has billions of records so I'm trying to stay away from UDF's and prefer to use built in methods if possible.
I normally use a window function with .size() and .collect_set() to count this type of data but adding multiple different new df columns based of different column categories is causing me issues because I'm not sure if I need to isolate the individual categories by adding additional window partitions or a .where() or isin() method with one window partition. Any feedback or recommendations are appreciated. Thank you.