I have been dealing with an issue related to writing out a Parquet file in Spark, when the input file is Parquet as well and contains some invalid column names.
Unfortunately, the column naming convention is outside of my hands, which is why I am attempting replace the non alpha-numeric characters before writing the file back out as Parquet.
Ex. of column names (These pain me):
"Current qty"
"31-60 Total"
"<=30 qty"
I have tried a handful of different methods to alter the dataframe columns to get them to valid format:
df.withColumnRenamed("Total qty","total_qty")
df.select(col("30-60 Total").alias("_30_60_total"))
Which all appear to work based on the printSchema results:
# Schema before change
>>> df.printSchema()
root
|-- Current qty: decimal(38,5) (nullable = true)
|-- <=30 qty: decimal(38,5) (nullable = true)
|-- 31-60 Total: decimal(38,5) (nullable = true)
>>> new_df = df.withColumnRenamed('Current qty','current_qty').withColumnRenamed('<=30 qty','__30_qty').withColumnRenamed('31-60 Total','_31_60_total')
# Schema after change
>>> new_df.printSchema()
root
|-- current_qty: decimal(38,5) (nullable = true)
|-- __30_qty: decimal(38,5) (nullable = true)
|-- _31_60_total: decimal(38,5) (nullable = true)
The issue starts when attempting to write this back out with the new column names. For some reason it appears to be referencing the original schema columns, which are invalid and will then throw an exception:
>>> new_df.write.mode('overwrite').parquet(write_path)
20/12/17 13:56:45 ERROR FileFormatWriter: Aborting job 37427d91-be5c-43c5-b3ed-8d57217a0733.
org.apache.spark.sql.AnalysisException: Attribute name "Current qty" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkConversionRequirement(ParquetSchemaConverter.scala:583)
at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkFieldName(ParquetSchemaConverter.scala:570)
at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:449)
at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:449)
at scala.collection.immutable.List.foreach(List.scala:392)
I have cross checked multiple times that this is not a case of using the wrong dataframe on the write, as well as tried it in Scala to no avail.
Interestingly this is not an issue when dealing with CSV input files, which leads to me believe it is specific to Spark-Parquet internal workings.
Would greatly appreciate any help on this if anyone has seen or encountered a similar error in the past.
Thanks!