Rename whitespace in column name in Parquet file using spark sql

Question

I want to show the content of the parquet file using Spark Sql but since the column names in parquet file contains space I am getting error - Attribute name "First Name" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;

I have written below code -

val r1 = spark.read.parquet("filepath")
val r2 = r1.toDF()
r2.select(r2("First Name").alias("FirstName")).show()

but still getting same error

Does this answer your question? [Spark Dataframe validating column names for parquet writes](https://stackoverflow.com/questions/38191157/spark-dataframe-validating-column-names-for-parquet-writes) — blackbishop, Feb 10 '22 at 19:46

score 0 · Answer 1 · answered Apr 28 '19 at 10:52

0

Try and rename the column first instead of aliasing it:

r2 = r2.withColumnRenamed("First Name", "FirstName")
r2.show()

answered Apr 28 '19 at 10:52

Rodney

5,417
7
54
98

score 0 · Answer 2 · answered Jan 05 '22 at 11:48

For anyone still looking for an answer, There is no optimised way to remove spaces from column names while dealing with parquet data.

What can be done is:

Change the column names at the source itself, i.e, while creating the parquet data itself.

OR

(NOT THE OPTIMISED WAY - won't WORK FOR HUGE DATASETS) read the parquet file using pandas and rename the column for the pandas dataframe. If required, write back the dataframe to a parquet using pandas itself and then progress using spark if required.

PS: With the new Pandas API for PySpark scheduled to be present from PySpark 3.2, implementing pandas with spark might be much faster and optimised when dealing with huge datasets.

Kentaro · Answer 3 · 2020-07-12T20:01:06.730

-1

For anybody struggling with this, the only thing that worked for me was:

for c in df.columns:
    df = df.withColumnRenamed(c, c.replace(" ", ""))

df = spark.read.schema(base_df.schema).parquet(filename)

This is from this thread: Spark Dataframe validating column names for parquet writes (scala)

Alias, withColumnRenamed, and "as" sql select statements wouldn't work. Pyspark would still use the old name whenever trying to .show() the dataframe.

edited Jul 12 '20 at 20:01

answered Jul 12 '20 at 19:54

Kentaro

145
8

1

When I use this, the data in the column is deleted (column is null) – NigelLegg Dec 03 '20 at 16:16
This doesn't work. The if you inspect df.schema you see it has no reference to the original column names, so when reading it fails to find the columns, and hence all values are null. The only solution I have found so far is to read with pandas, rename the columns, and then write it to spark. And this might not work if your data is too big for pandas. – s_pike Jul 22 '21 at 15:27

Rename whitespace in column name in Parquet file using spark sql

3 Answers3