You can transform the dataframe into an RDD and then back to a dataframe. When re-creating the dataframe, you can supply a schema where the column names are unique.
I use a simplified example where the fieldname field2
is not unique:
df = ...
df.printSchema()
#root
# |-- INFO_CSQ: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- field1: string (nullable = true)
# | | |-- field2: string (nullable = true)
# | | |-- field2: string (nullable = true)
import copy
schema_with_renames = copy.deepcopy(df.schema)
seen_fields = {}
#iterate over all fields and add a suffix where necessary
for f in schema_with_renames[0].dataType.elementType.fields:
name = f.name
suffix = ""
if name in seen_fields:
suffix = seen_fields[name] + 1
seen_fields[name] = suffix
else:
seen_fields[name] = 0
f.name = f.name + str(suffix)
df2 = spark.createDataFrame(df.rdd, schema_with_renames)
df2.printSchema()
#root
# |-- INFO_CSQ: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- field1: string (nullable = true)
# | | |-- field2: string (nullable = true)
# | | |-- field21: string (nullable = true)
Now you can either drop or ignore the renamed field field21
.