2

The documentation on Parquet files indicates that it can store / handle nested data types. However, I am unable to find much more information on best practices / pitfalls / ... when storing these nested datatypes to Parquet.

I am considering the following scenario:

  1. I am using PySpark (Spark 3.3) to store my Spark DataFrame to a Delta Lake file (which uses Parquet files under the hood).
  2. The Spark DataFrame has a nested column of datatype StructType in addition to many (100+) "regular" columns with a singular datatype.
  3. This nested column will have many sub-columns (100+)

Think something along the lines of

root
 |-- id: long (nullable = true)
 |-- key: string (nullable = true)
 |-- info: struct (nullable = false)
 |    |-- topic_1: string (nullable = true)
 |    |-- digit_1: long (nullable = true)
 | ...

Questions I have regarding this:

  1. Will the nested information be stored as a single column or will I find columns info.topic_1, 'info.digit_1`, ...?
  2. What about array columns or mapping columns?
  3. It seems that a lot of older SO posts indicate that more columns will be read in case a nested columns are present. Is still a problem for Spark 3?
  4. Any best practices of storing these nested datatypes?

I am aware of the following SO question, but people indicated it was for Spark 2.4

bramb
  • 213
  • 2
  • 14

1 Answers1

2

You don't need to worry about extra columns - columnar is parquet, neither about pushdown for performance.

The nested attributes are each stored as a new column. It's hard to explain, but definition levels and repetition levels are key concepts.

Please consult 2 excellent posts:

BTW: not sure why someone felt needs more focus.

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
  • 1
    Thanks! Perhaps an odd follow-up question, but what if you would store a mapping column to a Parquet where the datatype of a value at a specific key could change? Would that still work? – bramb Nov 14 '22 at 20:05
  • 1
    @bramb pls elaborate with example or new question. Not sure I know the answer. This is what I know http://cloudsqale.com/2020/06/18/how-map-column-is-written-to-parquet-converting-json-to-map-to-increase-read-performance/ – thebluephantom Nov 14 '22 at 20:38