Recently, I'm working on a big key-name-value dataset. I want to group by on the name, pivot on the key, and select the first value for those to generate new columns.
The operation is the following (in spark sql):
val df: DataFrame
df.groupBy("key").pivot("name").agg(first("value"))
// all executors go out of memory for an input file of 600MB
df.write.parquet("...")
The problem is that currently, about 5000 columns need to be generated, along with a lot of null values for every key. As SQL seems to build an if-else statement for each new column, I wondered what the time and space complexity for this problem is (as all the executors seem to go out of memory).
Thanks in advance