I'm trying to do something that seems pretty much straightforward but somehow cannot figure how to do it with pyspark.
I have a df with two columns (to simplify) 'id' and 'strcol', with possible duplicates ids
I want to do a df.groupBy('id') that would return for each id an array of the strcol values
simple exemple :
|--id--|--strCol--|
| a | {'a':1} |
| a | {'a':2} |
| b | {'b':3} |
| b | {'b':4} |
|------|----------|
would become
|--id--|-------aggsStr------|
| a | [{'a':1},{'a':2}] |
| b | [{'b':3},{'b':4}] |
|------|--------------------|
I tried to use apply with a pandas udf but it seems to refuse to return arrays. (or maybe I didn't use it correctly)