Extract first fields from struct columns into a dictionary

Question

I need to create a dictionary from Spark dataframe's schema of type pyspark.sql.types.StructType.

The code needs to go through entire StructType, find only those StructField elements which are of type StructType and, when extracting into dictionary, use the name of parent StructField as key while value would be name of only the first nested/child StructField.

Example schema (StructType):

root
|-- field_1: int
|-- field_2: int
|-- field_3: struct
|    |-- date: date
|    |-- timestamp: timestamp
|-- field_4: int

Desired result:

{"field_3": "date"}

ZygD · Accepted Answer · 2022-11-04T00:16:18.613

You can use a dictionary comprehension navigating through the schema.

{x.name: x.dataType[0].name for x in df.schema if x.dataType.typeName() == 'struct'}

Test #1

df = spark.createDataFrame([], 'field_1 int, field_2 int, field_3 struct<date:date,timestamp:timestamp>, field_4 int')

df.printSchema()
# root
#  |-- field_1: integer (nullable = true)
#  |-- field_2: integer (nullable = true)
#  |-- field_3: struct (nullable = true)
#  |    |-- date: date (nullable = true)
#  |    |-- timestamp: timestamp (nullable = true)
#  |-- field_4: integer (nullable = true)

{x.name: x.dataType[0].name for x in df.schema if x.dataType.typeName() == 'struct'}
# {'field_3': 'date'}

Test #2

df = spark.createDataFrame([], 'field_1 int, field_2 struct<col_int:int,col_long:long>, field_3 struct<date:date,timestamp:timestamp>')

df.printSchema()
# root
#  |-- field_1: integer (nullable = true)
#  |-- field_2: struct (nullable = true)
#  |    |-- col_int: integer (nullable = true)
#  |    |-- col_long: long (nullable = true)
#  |-- field_3: struct (nullable = true)
#  |    |-- date: date (nullable = true)
#  |    |-- timestamp: timestamp (nullable = true)

{x.name: x.dataType[0].name for x in df.schema if x.dataType.typeName() == 'struct'}
# {'field_2': 'col_int', 'field_3': 'date'}

I tried to add one more `field_5 choice` field but getting an error: `mismatched input '<' expecting {, '(', ',', 'COMMENT', NOT}(line 1, pos 100)` — marcin2x4, Nov 04 '22 at 13:00
Your syntax is off. `field_5 choice` - instead of `choice` you should tell the data type, so it should be `struct`, like this: `field_5 struct` — ZygD, Nov 04 '22 at 15:06
That might be caused by wrangling my dataset between spark and glue context... — marcin2x4, Nov 04 '22 at 15:23

Extract first fields from struct columns into a dictionary

1 Answers1