The documentation on Parquet files indicates that it can store / handle nested data types. However, I am unable to find much more information on best practices / pitfalls / ... when storing these nested datatypes to Parquet.
I am considering the following scenario:
- I am using PySpark (Spark 3.3) to store my Spark DataFrame to a Delta Lake file (which uses Parquet files under the hood).
- The Spark DataFrame has a nested column of datatype
StructType
in addition to many (100+) "regular" columns with a singular datatype. - This nested column will have many sub-columns (100+)
Think something along the lines of
root
|-- id: long (nullable = true)
|-- key: string (nullable = true)
|-- info: struct (nullable = false)
| |-- topic_1: string (nullable = true)
| |-- digit_1: long (nullable = true)
| ...
Questions I have regarding this:
- Will the nested information be stored as a single column or will I find columns
info.topic_1
, 'info.digit_1`, ...? - What about array columns or mapping columns?
- It seems that a lot of older SO posts indicate that more columns will be read in case a nested columns are present. Is still a problem for Spark 3?
- Any best practices of storing these nested datatypes?
I am aware of the following SO question, but people indicated it was for Spark 2.4