0

I am trying to lower the case of all columns names of PySpark Dataframe schema, including complex type columns' element names.

Example:

original_df
 |-- USER_ID: long (nullable = true)
 |-- COMPLEX_COL_ARRAY: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- KEY: timestamp (nullable = true)
 |    |    |-- VALUE: integer (nullable = true)
target_df
 |-- user_id: long (nullable = true)
 |-- complex_col_array: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- key: timestamp (nullable = true)
 |    |    |-- value: integer (nullable = true)

However, I've only been able to lower the case of column names using the script below:

from pyspark.sql.types import StructField
schema = df.schema
schema.fields = list(map(lambda field: StructField(field.name.lower(), field.dataType), schema.fields))

I know I can access the field names of nested elements using this syntax:

for f in schema.fields:
    if hasattr(f.dataType, 'elementType') and hasattr(f.dataType.elementType, 'fieldNames'):
        print(schema.f.dataType.elementType.fieldNames())

But how can I modify the case of these field names?

Thanks for your help!

jharb
  • 159
  • 3
  • 16
  • I would like to lower the case of all Parquet schemas, because I've been encountering case sensitivity issues between Hive, Parquet, JSON and Spark. – jharb Jun 15 '21 at 10:34

1 Answers1

0

Suggesting an answer to my own question, inspired by this question here: Rename nested field in spark dataframe

from pyspark.sql.types import StructField

# Read parquet file
path = "/path/to/data"
df = spark.read.parquet(path)
schema = df.schema

# Lower the case of all fields that are not nested
schema.fields = list(map(lambda field: StructField(field.name.lower(), field.dataType), schema.fields))

for f in schema.fields:
    # if field is nested and has named elements, lower the case of all element names
    if hasattr(f.dataType, 'elementType') and hasattr(f.dataType.elementType, 'fieldNames'):
        for e in f.dataType.elementType.fieldNames():
            schema[f.name].dataType.elementType[e].name =  schema[f.name].dataType.elementType[e].name.lower()
            ind = schema[f.name].dataType.elementType.names.index(e)
            schema[f.name].dataType.elementType.names[ind] = e.lower()

# Recreate dataframe with lowercase schema
df_lowercase = spark.createDataFrame(df.rdd, schema)
jharb
  • 159
  • 3
  • 16