4

The dataset I'm working on has whitespaces in its columns and I got struck while trying to rename spark dataframe column name. Tried almost all the solutions available in stackoverflow. Nothing seems to work.

Note: The file must be a parquet file.

df.printSchema

root
|-- Type: string (nullable = true)
|-- timestamp: string (nullable = true)
|-- ID: string (nullable = true)
|-- Catg Name: string (nullable = true)
|-- Error Msg: string (nullable = true)

df.show()
Error:

warning: there was one deprecation warning; re-run with -deprecation for details
org.apache.spark.sql.AnalysisException: Attribute name "Catg Name" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;

Tried:

df.select(df.col("Catg Name").alias("Catg_Name"))    

and then df.printSchema

root
|-- Type: string (nullable = true)
|-- timestamp: string (nullable = true)
|-- ID: string (nullable = true)
|-- Catg_Name: string (nullable = true)
|-- Error_Msg: string (nullable = true)

works well but when I use df.show() it throws the same error.

warning: there was one deprecation warning; re-run with -deprecation for details
org.apache.spark.sql.AnalysisException: Attribute name "Catg Name" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;

  • While doing df.select(df.col("Catg Name").alias("Catg_Name")), I can see you are only selecting but not assigning it back to the dataframe. Can you list the exact steps you tried? – Jijo Jun 20 '18 at 10:55
  • it would be helpful if you show us how you created df – Ramesh Maharjan Jun 20 '18 at 11:00
  • `val df = sqlContext.read.parquet("mypath/1_0_0.parquet") ` `df.printSchema` `val df_new = df.select(df.col("Catg Name").alias("Catg_Name")) ` `df_new.printSchema` df_new: org.apache.spark.sql.DataFrame = [Catg_Name: string] root |-- Catg_Name: string (nullable = true) `df_new.show()` **org.apache.spark.sql.AnalysisException: Attribute name "Catg Name" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;** – Pradeep Vasamsetty Jun 21 '18 at 06:21
  • @PradeepVasamsetti could you please let me know if you found any solution for this ? I am facing the same issue – prasannads Sep 13 '18 at 07:42
  • @PradeepVasamsetti were you able to solve this issue ? I am stuck with the same issue :( – prasannads Sep 13 '18 at 07:44
  • No man. I gave up on this and changed the schema at DB level :( – Pradeep Vasamsetty Jun 28 '19 at 08:19
  • I am having this issue too. I used this [method](https://stackoverflow.com/a/60208669/5983841) to replace spaces but didn't work. – Rick Apr 07 '21 at 09:22

1 Answers1

3

How about this idea by removing the spaces in column names and reassigning to Dataframe?

val df1 = df.toDF("col 1","col 2","col 3") // Dataframe with spaces in column names

val new_cols =  df1.columns.map(x => x.replaceAll(" ", "")) // new column names array with spaces removed

val df2 = df1.toDF(new_cols : _*) // df2 with new column names(spaces removed)
Praveen L
  • 937
  • 6
  • 13
  • 1
    new_cols: Array[String] = Array(Type, timestamp, ID, CatgName, ErrorMsg) df_new: org.apache.spark.sql.DataFrame = [Type: string ... 4 more fields] **org.apache.spark.sql.AnalysisException: Attribute name "Catg Name" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;** – Pradeep Vasamsetty Jun 20 '18 at 13:11
  • Are you facing issue if you create your dataframe similar to `df2` as in this solution? df2.show() will have column names without spaces. – Praveen L Jun 20 '18 at 13:24
  • Yes. When I do `df_new.show()`, which is `df2.show()` in your case, throws the error posted above. It still refers to `Catg Name` though I have changed the column name to `CatgName`. – Pradeep Vasamsetty Jun 20 '18 at 13:33
  • Check the Array[String] new_cols data. It should have spaces replaced. Comment your new_cols array code line – Praveen L Jun 21 '18 at 05:13
  • new_cols: Array[String] = Array(Type, timestamp, ID, CatgName, ErrorMsg) `df2.show()` org.apache.spark.sql.AnalysisException: Attribute name "Catg Name" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.; – Pradeep Vasamsetty Jun 21 '18 at 06:28
  • The above code doesn't work. It appears to be an error in Spark read Parquet API. You can rename the columns, but, when you do a show, Spark still goes back to the older column names. – add787 Jun 13 '19 at 02:25