6

I want to show the content of the parquet file using Spark Sql but since the column names in parquet file contains space I am getting error - Attribute name "First Name" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;

I have written below code -

val r1 = spark.read.parquet("filepath")
val r2 = r1.toDF()
r2.select(r2("First Name").alias("FirstName")).show()

but still getting same error

Rahul Wagh
  • 281
  • 6
  • 20
  • Does this answer your question? [Spark Dataframe validating column names for parquet writes](https://stackoverflow.com/questions/38191157/spark-dataframe-validating-column-names-for-parquet-writes) – blackbishop Feb 10 '22 at 19:46

3 Answers3

0

Try and rename the column first instead of aliasing it:

r2 = r2.withColumnRenamed("First Name", "FirstName")
r2.show()
Rodney
  • 5,417
  • 7
  • 54
  • 98
0

For anyone still looking for an answer, There is no optimised way to remove spaces from column names while dealing with parquet data.

What can be done is:

  • Change the column names at the source itself, i.e, while creating the parquet data itself.

OR

  • (NOT THE OPTIMISED WAY - won't WORK FOR HUGE DATASETS) read the parquet file using pandas and rename the column for the pandas dataframe. If required, write back the dataframe to a parquet using pandas itself and then progress using spark if required.

PS: With the new Pandas API for PySpark scheduled to be present from PySpark 3.2, implementing pandas with spark might be much faster and optimised when dealing with huge datasets.

-1

For anybody struggling with this, the only thing that worked for me was:

for c in df.columns:
    df = df.withColumnRenamed(c, c.replace(" ", ""))

df = spark.read.schema(base_df.schema).parquet(filename)

This is from this thread: Spark Dataframe validating column names for parquet writes (scala)

Alias, withColumnRenamed, and "as" sql select statements wouldn't work. Pyspark would still use the old name whenever trying to .show() the dataframe.

Kentaro
  • 145
  • 8
  • 1
    When I use this, the data in the column is deleted (column is null) – NigelLegg Dec 03 '20 at 16:16
  • This doesn't work. The if you inspect df.schema you see it has no reference to the original column names, so when reading it fails to find the columns, and hence all values are null. The only solution I have found so far is to read with pandas, rename the columns, and then write it to spark. And this might not work if your data is too big for pandas. – s_pike Jul 22 '21 at 15:27