3

Recently, I'm working on a big key-name-value dataset. I want to group by on the name, pivot on the key, and select the first value for those to generate new columns.

The operation is the following (in spark sql):

val df: DataFrame
df.groupBy("key").pivot("name").agg(first("value"))
// all executors go out of memory for an input file of 600MB
df.write.parquet("...")

The problem is that currently, about 5000 columns need to be generated, along with a lot of null values for every key. As SQL seems to build an if-else statement for each new column, I wondered what the time and space complexity for this problem is (as all the executors seem to go out of memory).

Thanks in advance

username_HI
  • 115
  • 1
  • 2
  • 5

0 Answers0