Nested data types in Parquet

Question

The documentation on Parquet files indicates that it can store / handle nested data types. However, I am unable to find much more information on best practices / pitfalls / ... when storing these nested datatypes to Parquet.

I am considering the following scenario:

I am using PySpark (Spark 3.3) to store my Spark DataFrame to a Delta Lake file (which uses Parquet files under the hood).
The Spark DataFrame has a nested column of datatype StructType in addition to many (100+) "regular" columns with a singular datatype.
This nested column will have many sub-columns (100+)

Think something along the lines of

root
 |-- id: long (nullable = true)
 |-- key: string (nullable = true)
 |-- info: struct (nullable = false)
 |    |-- topic_1: string (nullable = true)
 |    |-- digit_1: long (nullable = true)
 | ...

Questions I have regarding this:

Will the nested information be stored as a single column or will I find columns info.topic_1, 'info.digit_1`, ...?
What about array columns or mapping columns?
It seems that a lot of older SO posts indicate that more columns will be read in case a nested columns are present. Is still a problem for Spark 3?
Any best practices of storing these nested datatypes?

I am aware of the following SO question, but people indicated it was for Spark 2.4

Can accept answer pls? – thebluephantom May 23 '23 at 17:15 — thebluephantom, May 23 '23 at 17:15

thebluephantom · Answer 1 · 2022-11-13T20:35:20.577

2

You don't need to worry about extra columns - columnar is parquet, neither about pushdown for performance.

The nested attributes are each stored as a new column. It's hard to explain, but definition levels and repetition levels are key concepts.

Please consult 2 excellent posts:

BTW: not sure why someone felt needs more focus.

edited Nov 13 '22 at 20:35

answered Nov 13 '22 at 19:14

thebluephantom

16,458
8
40
83

1

Thanks! Perhaps an odd follow-up question, but what if you would store a mapping column to a Parquet where the datatype of a value at a specific key could change? Would that still work? – bramb Nov 14 '22 at 20:05
1

@bramb pls elaborate with example or new question. Not sure I know the answer. This is what I know http://cloudsqale.com/2020/06/18/how-map-column-is-written-to-parquet-converting-json-to-map-to-increase-read-performance/ – thebluephantom Nov 14 '22 at 20:38

Nested data types in Parquet

1 Answers1