3

Is there any performance benefit resulting from the usage of using nested data types in the Parquet file format?

AFAIK Parquet files are usually created specifically for query services e.g. Athena, so the process which creates those might as well simply flatten the values - thereby allowing easier querying, simpler schema, and retaining the column statistics for each column.

What benefit is there to be gained by using nested data types e.g. struct?

user976850
  • 1,086
  • 3
  • 13
  • 25

2 Answers2

6

There is a negative consequence keeping nested structure in parquet. The issue is spark predicate pushdown doesn't work properly if you have nested structure in the parquet file.

So even if you are working with few fields in your parquet dataset spark will load and materialize the entire dataset.

Here is the ticket which is opened for a long time regarding this issue.

EDIT

The issue has been resolved in spark 2.4 version.

Avishek Bhattacharya
  • 6,534
  • 3
  • 34
  • 53
  • Doesn't the nested data schema allow for predicate pushdown i.e. "column statistics" for each nested value as well? I believe I saw such values when looking inside a parquet file. (Or perhaps you meant that Spark specifically can't handle this metadata?). Are you saying it's better to flatten data altogether? I thought the whole purpose of Parquet was native support for nested data. – user976850 Mar 25 '18 at 14:03
  • Yes it is better to flatten the data. I also faced same issue where I kept the data nested and after flattening the queries on the dataframe became very fast. Here is another https://issues.apache.org/jira/browse/SPARK-4502 Ticket where they are trying to fix the issue – Avishek Bhattacharya Mar 25 '18 at 16:35
  • 1
    It was fixed with 2.4.0 release, so please update the answer. – UninformedUser Jun 30 '19 at 13:17
0

Quite the opposite - Parquet is a columnar format however as of Spark 2.3.0, Spark doesn't utilize it properly (see https://issues.apache.org/jira/browse/SPARK-4502) and using a struct/nested format means the whole column will be read and you can't benefit from reading just the data needed

re the answer by @avishek note that predicate pushdown means that spark (or any engine that utilizes that feature of parquet) will read the whole dataset it means the engine can use the metadata about columns (like min man values) to determine if a chuck should be read or not, if a chunk needs to be read, parquet will allow reading just the requested columns

Edited: moved info from comment to main answer

Arnon Rotem-Gal-Oz
  • 25,469
  • 3
  • 45
  • 68
  • But doesn't each nested field count as additional "column"? – user976850 Mar 25 '18 at 14:05
  • from parquet's point of view yes (though it is a little more involved than regular columns see https://blog.twitter.com/engineering/en_us/a/2013/dremel-made-simple-with-parquet.html) - Spark however doesn't use that (yet) https://issues.apache.org/jira/browse/SPARK-4502 – Arnon Rotem-Gal-Oz Mar 25 '18 at 14:49