What's the time and space complexity of the PIVOT operation in Spark SQL?

Asked Jun 24 '19 at 09:35

Active Jun 24 '19 at 09:35

Viewed 278 times

Recently, I'm working on a big key-name-value dataset. I want to group by on the name, pivot on the key, and select the first value for those to generate new columns.

The operation is the following (in spark sql):

val df: DataFrame
df.groupBy("key").pivot("name").agg(first("value"))
// all executors go out of memory for an input file of 600MB
df.write.parquet("...")

The problem is that currently, about 5000 columns need to be generated, along with a lot of null values for every key. As SQL seems to build an if-else statement for each new column, I wondered what the time and space complexity for this problem is (as all the executors seem to go out of memory).

Thanks in advance

asked Jun 24 '19 at 09:35

username_HI

What's the time and space complexity of the PIVOT operation in Spark SQL?

0 Answers0