3

I have a question about the possibility to prune nested fields.

I'm developing a source for High Energy Physics Data format (ROOT).

below is the schema for some file using a DataSource that I'm developing.

 root
 |-- EventAuxiliary: struct (nullable = true)
 |    |-- processHistoryID_: struct (nullable = true)
 |    |    |-- hash_: string (nullable = true)
 |    |-- id_: struct (nullable = true)
 |    |    |-- run_: integer (nullable = true)
 |    |    |-- luminosityBlock_: integer (nullable = true)
 |    |    |-- event_: long (nullable = true)
 |    |-- processGUID_: string (nullable = true)
 |    |-- time_: struct (nullable = true)
 |    |    |-- timeLow_: integer (nullable = true)
 |    |    |-- timeHigh_: integer (nullable = true)
 |    |-- luminosityBlock_: integer (nullable = true)
 |    |-- isRealData_: boolean (nullable = true)
 |    |-- experimentType_: integer (nullable = true)
 |    |-- bunchCrossing_: integer (nullable = true)
 |    |-- orbitNumber_: integer (nullable = true)
 |    |-- storeNumber_: integer (nullable = true)

The DataSource is here https://github.com/diana-hep/spark-root/blob/master/src/main/scala/org/dianahep/sparkroot/experimental/package.scala#L62

When building a reader using the buildReader method of the FileFormat:

override def buildReaderWithPartitionValues(
    sparkSession: SparkSession,
    dataSchema: StructType,
    partitionSchema: StructType,
    requiredSchema: StructType,
    filters: Seq[Filter],
    options: Map[String, String],
    hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]

I see that requiredSchema will always contain all of the fields/members of the top column that is being looked at. Meaning that when I want to select a particular nested field with : df.select("EventAuxiliary.id_.run_"), requiredSchema will be again the full struct for that top column ("EventAuxiliary"). I would expect that schema would be something like this:

root
|-- EventAuxiliary: struct...
|  |-- id_: struct ...
|  |    |-- run_: integer

since this is the only schema that has been required by the select statement.

Basically, I want to know how on the data source level I can prune nested fields. I thought that requiredSchema will be only the fields that are coming from the df.select.

I'm trying to see what avro/parquet are doing and found this: https://github.com/apache/spark/pull/14957/files

If there are suggestions/comments - would be appreciated!

Thanks!

VK

Viktor Khristenko
  • 863
  • 1
  • 5
  • 17
  • 1
    I believe this is a very common issue, MongoDB plugin has fixed it, you can see https://github.com/Stratio/Spark-MongoDB/blob/master/spark-mongodb/src/main/scala/com/stratio/datasource/mongodb/MongodbRelation.scala. elasticsearch plugins not yet :/ – Thomas Decaux Sep 04 '18 at 14:27
  • See https://stackoverflow.com/questions/57331007/efficient-reading-nested-parquet-column-in-spark – Tomas Bartalos Aug 06 '19 at 12:11

0 Answers0