2

I have column in dataframe is called "INFO_CSQ". I want to delete any duplicated element in struct that make me can't use command df.select("INFO_CSQ.xxx") because ambiguous reference.

If you want more information please feel free to ask me. I will answer as soon as possible.

Edit I saw many solutions are using rename and all of them that I looked is manually type like strSchema = "array<struct<a_renamed:string,b:bigint,c:bigint>>" and cast to new dataframe however my schema is changeable depend on input file.

enter image description here

enter image description here

phumidea
  • 45
  • 4
  • How you have created dataframe to get "TSL" column as duplicate? – Vivs May 16 '21 at 08:24
  • It seems like you have two TSL columns. If they are duplicate, drop one. If they aren't, rename one. – crissal May 16 '21 at 08:41
  • @Vivs I use glow library to read VCF file as dataframe and original VCF file contains duplicate. https://glow.readthedocs.io/en/latest/etl/variant-data.html – phumidea May 16 '21 at 08:44
  • Based on your input, if the column names are exactly the same, you will need to manually specify the schema and skip the first row to avoid the headers while reading VCF file. – Vivs May 16 '21 at 08:49
  • @crissal Both are duplicated. I try to drop but I'm new in pyspark. Now I'm looking for command to deal with it. – phumidea May 16 '21 at 09:11
  • @Vivs I don't think [Allele, Consequence, ...] is columns so I can't renamed it easily because it is in struct. I think it would be easier if I extract and make it as new dataframe moreover glow is library to read VCF file that I doesn't see any parameter to avoid the header but I will try to follow your comment. – phumidea May 16 '21 at 09:22

1 Answers1

1

You can transform the dataframe into an RDD and then back to a dataframe. When re-creating the dataframe, you can supply a schema where the column names are unique.

I use a simplified example where the fieldname field2 is not unique:

df = ...
df.printSchema()
#root
# |-- INFO_CSQ: array (nullable = true)
# |    |-- element: struct (containsNull = true)
# |    |    |-- field1: string (nullable = true)
# |    |    |-- field2: string (nullable = true)
# |    |    |-- field2: string (nullable = true)

import copy
schema_with_renames = copy.deepcopy(df.schema)
seen_fields = {}
#iterate over all fields and add a suffix where necessary
for f in schema_with_renames[0].dataType.elementType.fields:
    name = f.name
    suffix = ""
    if name in seen_fields:
        suffix = seen_fields[name] + 1
        seen_fields[name] = suffix
    else:
        seen_fields[name] = 0
    f.name = f.name + str(suffix)

df2 = spark.createDataFrame(df.rdd, schema_with_renames)
df2.printSchema()
#root
# |-- INFO_CSQ: array (nullable = true)
# |    |-- element: struct (containsNull = true)
# |    |    |-- field1: string (nullable = true)
# |    |    |-- field2: string (nullable = true)
# |    |    |-- field21: string (nullable = true)

Now you can either drop or ignore the renamed field field21.

werner
  • 13,518
  • 6
  • 30
  • 45