I have a data frame named "household" in following schema:
root
|-- country_code: string (nullable = true)
|-- region_code: string (nullable = true)
|-- individuals: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- individual_id: string (nullable = true)
| | |-- ids: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- id_last_seen: date (nullable = true)
| | | | |-- type: string (nullable = true)
| | | | |-- value: string (nullable = true)
| | | | |-- year_released: integer (nullable = true)
I can use the following code to find the households that contain at least one device that was released after the year 2018
val sql = """
select household_id
from household
where exists(individuals, id -> exists(id.ids, dev -> dev.year_released > 2018))
"""
val v = spark.sql(sql)
It works well, however, I found the spark query planner was not able to prune the unneeded columns. The plan shows that Spark has to read all columns of the nested structures
Tested this with spark 2.4.5 and 3.0.0, got the same result.
Just wonder if Spark supports or will add support to column scan pruning for an array of structs?