Spark Schema Not Being used for Parquet Write

Question

I have been dealing with an issue related to writing out a Parquet file in Spark, when the input file is Parquet as well and contains some invalid column names.

Unfortunately, the column naming convention is outside of my hands, which is why I am attempting replace the non alpha-numeric characters before writing the file back out as Parquet.

Ex. of column names (These pain me):

  "Current qty"
  "31-60 Total"
  "<=30 qty"

I have tried a handful of different methods to alter the dataframe columns to get them to valid format:

df.withColumnRenamed("Total qty","total_qty")
df.select(col("30-60 Total").alias("_30_60_total"))

Which all appear to work based on the printSchema results:

# Schema before change
>>> df.printSchema()
root
 |-- Current qty: decimal(38,5) (nullable = true)
 |-- <=30 qty: decimal(38,5) (nullable = true)
 |-- 31-60 Total: decimal(38,5) (nullable = true)

>>> new_df = df.withColumnRenamed('Current qty','current_qty').withColumnRenamed('<=30 qty','__30_qty').withColumnRenamed('31-60 Total','_31_60_total')

# Schema after change
>>> new_df.printSchema()
root
|-- current_qty: decimal(38,5) (nullable = true)
|-- __30_qty: decimal(38,5) (nullable = true)
|-- _31_60_total: decimal(38,5) (nullable = true)

The issue starts when attempting to write this back out with the new column names. For some reason it appears to be referencing the original schema columns, which are invalid and will then throw an exception:

>>> new_df.write.mode('overwrite').parquet(write_path)
20/12/17 13:56:45 ERROR FileFormatWriter: Aborting job 37427d91-be5c-43c5-b3ed-8d57217a0733.
org.apache.spark.sql.AnalysisException: Attribute name "Current qty" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
        at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkConversionRequirement(ParquetSchemaConverter.scala:583)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkFieldName(ParquetSchemaConverter.scala:570)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:449)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:449)
        at scala.collection.immutable.List.foreach(List.scala:392)

I have cross checked multiple times that this is not a case of using the wrong dataframe on the write, as well as tried it in Scala to no avail.

Interestingly this is not an issue when dealing with CSV input files, which leads to me believe it is specific to Spark-Parquet internal workings.

Would greatly appreciate any help on this if anyone has seen or encountered a similar error in the past.

Thanks!

Does this answer your question? [Spark Dataframe validating column names for parquet writes (scala)](https://stackoverflow.com/questions/38191157/spark-dataframe-validating-column-names-for-parquet-writes-scala) — mck, Dec 17 '20 at 19:16

score 0 · Answer 1 · answered Dec 22 '20 at 03:35

0

You can try to simply do the aliasing , it works for me Everytime .

df.select([df['Current qty'].alias('current_qty'),df['<=30'].alias('__30_qty'),df{'31-60 Total'].alias('_31_60_total')]).write.mode('overwrite').parquet(write_path)

answered Dec 22 '20 at 03:35

Aditya Vikram Singh

468
4
12

Spark Schema Not Being used for Parquet Write

1 Answers1