How to remove duplicate element in struct of array pyspark

Question

I have column in dataframe is called "INFO_CSQ". I want to delete any duplicated element in struct that make me can't use command df.select("INFO_CSQ.xxx") because ambiguous reference.

If you want more information please feel free to ask me. I will answer as soon as possible.

Edit I saw many solutions are using rename and all of them that I looked is manually type like strSchema = "array<struct<a_renamed:string,b:bigint,c:bigint>>" and cast to new dataframe however my schema is changeable depend on input file.

How you have created dataframe to get "TSL" column as duplicate? — Vivs, May 16 '21 at 08:24
It seems like you have two TSL columns. If they are duplicate, drop one. If they aren't, rename one. — crissal, May 16 '21 at 08:41
@Vivs I use glow library to read VCF file as dataframe and original VCF file contains duplicate. https://glow.readthedocs.io/en/latest/etl/variant-data.html — phumidea, May 16 '21 at 08:44
Based on your input, if the column names are exactly the same, you will need to manually specify the schema and skip the first row to avoid the headers while reading VCF file. — Vivs, May 16 '21 at 08:49
@crissal Both are duplicated. I try to drop but I'm new in pyspark. Now I'm looking for command to deal with it. — phumidea, May 16 '21 at 09:11
@Vivs I don't think [Allele, Consequence, ...] is columns so I can't renamed it easily because it is in struct. I think it would be easier if I extract and make it as new dataframe moreover glow is library to read VCF file that I doesn't see any parameter to avoid the header but I will try to follow your comment. — phumidea, May 16 '21 at 09:22

werner · Answer 1 · 2021-05-16T15:11:53.677

You can transform the dataframe into an RDD and then back to a dataframe. When re-creating the dataframe, you can supply a schema where the column names are unique.

I use a simplified example where the fieldname field2 is not unique:

df = ...
df.printSchema()
#root
# |-- INFO_CSQ: array (nullable = true)
# |    |-- element: struct (containsNull = true)
# |    |    |-- field1: string (nullable = true)
# |    |    |-- field2: string (nullable = true)
# |    |    |-- field2: string (nullable = true)

import copy
schema_with_renames = copy.deepcopy(df.schema)
seen_fields = {}
#iterate over all fields and add a suffix where necessary
for f in schema_with_renames[0].dataType.elementType.fields:
    name = f.name
    suffix = ""
    if name in seen_fields:
        suffix = seen_fields[name] + 1
        seen_fields[name] = suffix
    else:
        seen_fields[name] = 0
    f.name = f.name + str(suffix)

df2 = spark.createDataFrame(df.rdd, schema_with_renames)
df2.printSchema()
#root
# |-- INFO_CSQ: array (nullable = true)
# |    |-- element: struct (containsNull = true)
# |    |    |-- field1: string (nullable = true)
# |    |    |-- field2: string (nullable = true)
# |    |    |-- field21: string (nullable = true)

Now you can either drop or ignore the renamed field field21.

How to remove duplicate element in struct of array pyspark

1 Answers1