1

I need to create a dictionary from Spark dataframe's schema of type pyspark.sql.types.StructType.

The code needs to go through entire StructType, find only those StructField elements which are of type StructType and, when extracting into dictionary, use the name of parent StructField as key while value would be name of only the first nested/child StructField.

Example schema (StructType):

root
|-- field_1: int
|-- field_2: int
|-- field_3: struct
|    |-- date: date
|    |-- timestamp: timestamp
|-- field_4: int

Desired result:

{"field_3": "date"}
ZygD
  • 22,092
  • 39
  • 79
  • 102
marcin2x4
  • 1,321
  • 2
  • 18
  • 44

1 Answers1

2

You can use a dictionary comprehension navigating through the schema.

{x.name: x.dataType[0].name for x in df.schema if x.dataType.typeName() == 'struct'}

Test #1

df = spark.createDataFrame([], 'field_1 int, field_2 int, field_3 struct<date:date,timestamp:timestamp>, field_4 int')

df.printSchema()
# root
#  |-- field_1: integer (nullable = true)
#  |-- field_2: integer (nullable = true)
#  |-- field_3: struct (nullable = true)
#  |    |-- date: date (nullable = true)
#  |    |-- timestamp: timestamp (nullable = true)
#  |-- field_4: integer (nullable = true)

{x.name: x.dataType[0].name for x in df.schema if x.dataType.typeName() == 'struct'}
# {'field_3': 'date'}

Test #2

df = spark.createDataFrame([], 'field_1 int, field_2 struct<col_int:int,col_long:long>, field_3 struct<date:date,timestamp:timestamp>')

df.printSchema()
# root
#  |-- field_1: integer (nullable = true)
#  |-- field_2: struct (nullable = true)
#  |    |-- col_int: integer (nullable = true)
#  |    |-- col_long: long (nullable = true)
#  |-- field_3: struct (nullable = true)
#  |    |-- date: date (nullable = true)
#  |    |-- timestamp: timestamp (nullable = true)

{x.name: x.dataType[0].name for x in df.schema if x.dataType.typeName() == 'struct'}
# {'field_2': 'col_int', 'field_3': 'date'}
ZygD
  • 22,092
  • 39
  • 79
  • 102
  • One of the most elegant ways I've seen! Thank you so much! – marcin2x4 Nov 04 '22 at 00:28
  • I tried to add one more `field_5 choice` field but getting an error: `mismatched input '<' expecting {, '(', ',', 'COMMENT', NOT}(line 1, pos 100)` – marcin2x4 Nov 04 '22 at 13:00
  • Your syntax is off. `field_5 choice` - instead of `choice` you should tell the data type, so it should be `struct`, like this: `field_5 struct` – ZygD Nov 04 '22 at 15:06
  • That might be caused by wrangling my dataset between spark and glue context... – marcin2x4 Nov 04 '22 at 15:23