Spark remove special characters from column name read from a parquet file

Question

I have parquet files which I have read using the following spark command

lazy val out = spark.read.parquet("/tmp/oip/logprint_poc/feb28eb24ffe44cab60f2832a98795b1.parquet")

The column names of a lot of column has special chracter "(". like WA_0_DWHRPD_Purge_Date_(TOD), WA_0_DWHRRT_Record_Type_(80=Index) How can I remove this special character.

My end goal is to remove these special character and write the parquet file back using the following command

df_hive.write.format("parquet").save("hdfs:///tmp/oip/logprint_poc_cleaned/")

Also, I am using Scala spark shell. I am new to spark, I saw similar questions but nothing is working in my case. Any help is appreciated.

@mck I am using scala, the answer seens to be for pyspark which is not working in my case — Ishan Tiwary, Feb 02 '21 at 15:00

score 0 · Accepted Answer · answered Feb 02 '21 at 15:00

First thing you can do is read the parquet files into the data frame as you are doing.

val out = spark.read.parquet("/tmp/oip/logprint_poc/feb28eb24ffe44cab60f2832a98795b1.parquet")

Once you have created the data frame, try to fetch the schema of the data frame and parse through it to remove all the special characters as below :

import org.apache.spark.sql.functions._
val schema = StructType(out.schema.map(
          x => StructField(x.name.toLowerCase().replace(" ", "_").replace("#", "").replace("-", "_").replace(")", "").replace("(", "").trim(),
            x.dataType, x.nullable)))

Now you can read the data back from the parquet files by specifying the schema that you have created.

val newDF = spark.read.format("parquet").schema(schema).load("/tmp/oip/logprint_poc/feb28eb24ffe44cab60f2832a98795b1.parquet")

Now you can go ahead and save the data frame as you wanted with the cleaned column names.

df_hive.write.format("parquet").save("hdfs:///tmp/oip/logprint_poc_cleaned/")

Spark remove special characters from column name read from a parquet file

1 Answers1