23

My team is building an ETL process to load raw delimited text files into a Parquet based "data lake" using Spark. One of the promises of the Parquet column store is that a query will only read the necessary "column stripes".

But we're seeing unexpected columns being read for nested schema structures.

To demonstrate, here is a POC using Scala and the Spark 2.0.1 shell:

// Preliminary setup
sc.setLogLevel("INFO")
import org.apache.spark.sql.types._
import org.apache.spark.sql._

// Create a schema with nested complex structures
val schema = StructType(Seq(
    StructField("F1", IntegerType),
    StructField("F2", IntegerType),
    StructField("Orig", StructType(Seq(
        StructField("F1", StringType),
        StructField("F2", StringType))))))

// Create some sample data
val data = spark.createDataFrame(
    sc.parallelize(Seq(
        Row(1, 2, Row("1", "2")),
        Row(3, null, Row("3", "ABC")))),
    schema)

// Save it
data.write.mode(SaveMode.Overwrite).parquet("data.parquet")

Then we read the file back into a DataFrame and project to a subset of columns:

// Read it back into another DataFrame
val df = spark.read.parquet("data.parquet")

// Select & show a subset of the columns
df.select($"F1", $"Orig.F1").show

When this runs we see the expected output:

+---+-------+
| F1|Orig_F1|
+---+-------+
|  1|      1|
|  3|      3|
+---+-------+

But... the query plan shows a slightly different story:

The "optimized plan" shows:

val projected = df.select($"F1", $"Orig.F1".as("Orig_F1"))
projected.queryExecution.optimizedPlan
// Project [F1#18, Orig#20.F1 AS Orig_F1#116]
// +- Relation[F1#18,F2#19,Orig#20] parquet

And "explain" shows:

projected.explain
// == Physical Plan ==
// *Project [F1#18, Orig#20.F1 AS Orig_F1#116]
// +- *Scan parquet [F1#18,Orig#20] Format: ParquetFormat, InputPaths: hdfs://sandbox.hortonworks.com:8020/user/stephenp/data.parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<F1:int,Orig:struct<F1:string,F2:string>>

And the INFO logs produced during execution also confirm that the Orig.F2 column is unexpectedly read:

16/10/21 15:13:15 INFO parquet.ParquetReadSupport: Going to read the following fields from the Parquet file:

Parquet form:
message spark_schema {
  optional int32 F1;
  optional group Orig {
    optional binary F1 (UTF8);
    optional binary F2 (UTF8);
  }
}

Catalyst form:
StructType(StructField(F1,IntegerType,true), StructField(Orig,StructType(StructField(F1,StringType,true), StructField(F2,StringType,true)),true))

According to the Dremel paper and the Parquet documentation, columns for complex nested structures should be independently stored and independently retrievable.

Questions:

  1. Is this behavior a limitation of the current Spark query engine? In other words, does Parquet support optimally executing this query, but Spark's query planner is naive?
  2. Or, is this a limitation of the current Parquet implementation?
  3. Or, am I not using the Spark APIs correctly?
  4. Or, am I misunderstanding how Dremel/Parquet column storage is supposed to work?

Possibly related: Why does the query performance differ with nested columns in Spark SQL?

Community
  • 1
  • 1
Peter Stephens
  • 1,040
  • 1
  • 9
  • 23
  • It is a Spark query engine issue. –  Oct 21 '16 at 20:38
  • @LostInOverflow, do you know if this is in the Spark issue tracker? https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:issues-panel – Peter Stephens Oct 21 '16 at 20:54
  • It appears that Parquet should support this scenario according to @julien-le-dem https://twitter.com/J_/status/789584704169123841 – Peter Stephens Oct 21 '16 at 22:39
  • 1
    https://github.com/apache/spark/pull/16578 solves the problem. – Gaurav Shah Jul 22 '17 at 02:37
  • @GauravShah, looking forward to seeing this, hopefully in 2.3. Will try and create/mark this as the accepted answer once available in a production release. – Peter Stephens Jul 27 '17 at 18:05
  • A small update-- the patch is slowly progressing through the Spark development process. Now maybe targeting 2.4 per this patch: https://github.com/apache/spark/pull/16578 – Peter Stephens Mar 28 '18 at 18:25
  • 2
    An update. The previous PR has been closed without resolution and a new simplified PR has been opened. Track the new one here: https://github.com/apache/spark/pull/21320 – Peter Stephens Jun 12 '18 at 14:28
  • 2.4.0 appears to fix this problem. I'll see if I can test the fix and then write up the answer. – Peter Stephens Nov 09 '18 at 18:41
  • Update: 2.4.0 does not seem to fix this problem. The F2 column is still read even though it is not specified in the projection. – Peter Stephens Nov 09 '18 at 21:51
  • did you turn on the sql flag to enable this optimization ? 2.4.0 seems to fix it for me – Gaurav Shah Nov 12 '18 at 04:51

2 Answers2

5

It's a limitation on the Spark query engine at the moment, the relevant JIRA ticket is below, spark only handles predicate pushdown of simple types in Parquet, not nested StructTypes

https://issues.apache.org/jira/browse/SPARK-17636

Ewan Leith
  • 1,655
  • 11
  • 10
  • Limitations in predicate push-down shouldn't affect projections. Problem can be related but not the same. –  Nov 11 '16 at 08:37
  • Sorry I used the word predicate in my answer but the linked JIRA ticket title is "Parquet filter push down doesn't handle struct fields" – Ewan Leith Nov 25 '16 at 09:23
  • Not sure this is the answer, but will take a look. The OP doesn't have a filter and so predicate pushdown shouldn't apply. – Peter Stephens Dec 07 '16 at 18:17
0

The issue has been fixed since Spark 2.4.0. This applies to struct as well as array of structs.

Before Spark 3.0.0:

Set spark.sql.optimizer.nestedSchemaPruning.enabled to true

See related Jira here: https://issues.apache.org/jira/browse/SPARK-4502

After Spark 3.0.0:

spark.sql.optimizer.nestedSchemaPruning.enabled now default is true

Related Jira here: https://issues.apache.org/jira/browse/SPARK-29805

Also related SO question: Efficient reading nested parquet column in Spark

Joshua Chen
  • 351
  • 3
  • 6