-1

I have parquet files which I have read using the following spark command

lazy val out = spark.read.parquet("/tmp/oip/logprint_poc/feb28eb24ffe44cab60f2832a98795b1.parquet")

The column names of a lot of column has special chracter "(". like WA_0_DWHRPD_Purge_Date_(TOD), WA_0_DWHRRT_Record_Type_(80=Index) How can I remove this special character.

My end goal is to remove these special character and write the parquet file back using the following command

df_hive.write.format("parquet").save("hdfs:///tmp/oip/logprint_poc_cleaned/")

Also, I am using Scala spark shell. I am new to spark, I saw similar questions but nothing is working in my case. Any help is appreciated.

Ishan Tiwary
  • 938
  • 6
  • 15
  • 37

1 Answers1

0

First thing you can do is read the parquet files into the data frame as you are doing.

val out = spark.read.parquet("/tmp/oip/logprint_poc/feb28eb24ffe44cab60f2832a98795b1.parquet")

Once you have created the data frame, try to fetch the schema of the data frame and parse through it to remove all the special characters as below :

import org.apache.spark.sql.functions._
val schema = StructType(out.schema.map(
          x => StructField(x.name.toLowerCase().replace(" ", "_").replace("#", "").replace("-", "_").replace(")", "").replace("(", "").trim(),
            x.dataType, x.nullable)))

Now you can read the data back from the parquet files by specifying the schema that you have created.

val newDF = spark.read.format("parquet").schema(schema).load("/tmp/oip/logprint_poc/feb28eb24ffe44cab60f2832a98795b1.parquet")

Now you can go ahead and save the data frame as you wanted with the cleaned column names.

df_hive.write.format("parquet").save("hdfs:///tmp/oip/logprint_poc_cleaned/")
Nikunj Kakadiya
  • 2,689
  • 2
  • 20
  • 35