0

I am using PySpark 2.4.3 and I have a dataframe that I wish to write to Parquet, but the column names have spaces, such as Hour of day.

df = spark.read.csv("file.csv", header=True)
df.write.parquet('input-parquet/')

I am getting this error currently:

An error occurred while calling o425.parquet.
: org.apache.spark.sql.AnalysisException: Attribute name "Hour of day" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;

How can I either rename the columns or give them aliases to be able to write to Parquet?

crystyxn
  • 1,411
  • 4
  • 27
  • 59

1 Answers1

1

You can rename the column with the withColumnRenamed(existing, new) method, and then write to parquet. It would be something like this:

df.withColumnRenamed('Hour of day', 'Hour')
Bitswazsky
  • 4,242
  • 3
  • 29
  • 58