Say I have following table/dataframe:
Id | Col1 | Col2 | Col3 |
---|---|---|---|
1 | 100 | aaa | xxx |
2 | 200 | aaa | yyy |
3 | 300 | ccc | zzz |
I need to calculate an extra column CalculatedValue which could have one or multiple values based on other columns' values.
I have tried with a regular CASE WHEN statement like:
df_out = (df_source
.withColumn('CalculatedValue',
expr("CASE WHEN Col1 = 100 THEN 'AAA111'
WHEN Col2 = 'aaa' then 'BBB222'
WHEN Col3 = 'zzz' then 'CCC333'
END")
)
Note I'm doing it with expr() because the actual CASE WHEN statement is a very long string built dynamically.
This results in a table/dataframe like this:
Id | Col1 | Col2 | Col3 | CalculatedValue |
---|---|---|---|---|
1 | 100 | aaa | xxx | AAA111 |
2 | 200 | aaa | yyy | BBB222 |
3 | 300 | ccc | zzz | CCC333 |
However what I need looks more like this, where the CASE WHEN statement didn't stop evaluating after the first match, and instead evaluated all conditions and accumulated all matches into, say, an array
Id | Col1 | Col2 | Col3 | CalculatedValue |
---|---|---|---|---|
1 | 100 | aaa | xxx | [AAA111, BBB222] |
2 | 200 | aaa | yyy | BBB222 |
3 | 300 | ccc | zzz | CCC333 |
Any ideas? Thanks