3

I have a data frame named "household" in following schema:

root
 |-- country_code: string (nullable = true)
 |-- region_code: string (nullable = true)
 |-- individuals: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- individual_id: string (nullable = true)
 |    |    |-- ids: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- id_last_seen: date (nullable = true)
 |    |    |    |    |-- type: string (nullable = true)
 |    |    |    |    |-- value: string (nullable = true)
 |    |    |    |    |-- year_released: integer (nullable = true)

I can use the following code to find the households that contain at least one device that was released after the year 2018

val sql = """
select household_id
from household
where exists(individuals, id -> exists(id.ids, dev -> dev.year_released > 2018))
"""
val v = spark.sql(sql)

It works well, however, I found the spark query planner was not able to prune the unneeded columns. The plan shows that Spark has to read all columns of the nested structures

Tested this with spark 2.4.5 and 3.0.0, got the same result.

Just wonder if Spark supports or will add support to column scan pruning for an array of structs?

seiya
  • 1,477
  • 3
  • 17
  • 26
  • Did your query run slowly? As far as I know, pruning only applies to partitions in Spark. – Lars Skaug Jul 20 '20 at 15:21
  • It was slow as Spark had to read all the columns in nested structs even though only one was actually used – seiya Jul 20 '20 at 16:12
  • You would probably need to normalize the ids array into a new dataframe. That would only be worth it if the select statement will be run frequently, however. – Lars Skaug Jul 20 '20 at 16:45
  • Recent spark versions do support pruning nested columns, but still not 100%. You can try out https://github.com/taboola/ScORe, might work. You can also provide the schema manually. – Lior Chaga Jun 03 '21 at 17:13

1 Answers1

1

Yes.

To activate the nested schema pruning, you have to set that option on the context:

spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", "true")

Check this answer here: Efficient reading nested parquet column in Spark

btbbass
  • 125
  • 1
  • 9