I have following (simplified) schema:
root
|-- event: struct (nullable = true)
| |-- spent: struct (nullable = true)
| | |-- amount: decimal(34,3) (nullable = true)
| | |-- currency: string (nullable = true)
| |
| | ... ~ 20 other struct fields on "event" level
I'm trying to sum on nested field
spark.sql("select sum(event.spent.amount) from event")
According to spark metrics I'm reading 18 GB from disk and it takes 2.5 min.
However when I select the top level field:
spark.sql("select sum(amount) from event")
I read only 2GB in 4 seconds.
From the physical plan I can see that in case of nested structure the whole event struct with all fields are read from parquet, which is a waste.
Parquet format should be able to provide the desired column from nested structure without reading it all (which is the point of columnar store). Is there some way to do this efficiently in Spark ?