0

I have a piece of pyspark code the converts a dataframe into a physical table:

df.write.mode('overwrite').saveAsTable('sometablename')

In case the dataframe, df, contains columns which have spaces in their names it fails with the following error:

18/03/08 10:33:29 ERROR CreateDataSourceTableAsSelectCommand: Failed to write to table pivot_up_spaces_Export_Data_4
org.apache.spark.sql.AnalysisException: Attribute name "SUM_count_col umn" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
        at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkConversionRequirement(ParquetSchemaConverter.scala:581)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter$.checkFieldName(ParquetSchemaConverter.scala:567)

when I use registerTempTable on the same table, things work fine:

df.registerTempTable('sometablename')

However, I in spark-sql, I am able to create tables which have spaces in the column names. Is there any way I can get around this situation in pyspark ?

I am running this on a EMR 5.10.0 cluster which internally uses Spark 2.2.0.

neves
  • 33,186
  • 27
  • 159
  • 192
Sid
  • 77
  • 1
  • 9
  • 2
    Do you need to keep the spaces? Are you open to renaming the columns? – pault Mar 08 '18 at 15:15
  • As @pault said, is there any reason why you need the spaces? Because you would normally not be able to create a table with spaces in any database system. My guess is that `registerTempTable` works because everything is stored in memory. – mkaran Mar 08 '18 at 15:36
  • The table schema will be defined by the end user and hence the ask. – Sid Mar 09 '18 at 04:26
  • 1
    Possible duplicate of [Spark Dataframe validating column names for parquet writes (scala)](https://stackoverflow.com/questions/38191157/spark-dataframe-validating-column-names-for-parquet-writes-scala) – Eugene Lopatkin Jun 04 '18 at 07:02
  • SQL Server (T-SQL) allows for columns to be created with spaced. If the column names are surrounded by double quotes or brackets, the spaces are allowed to remain without errors. As it is not uncommon to find column names in a CSV or Excel sheet that contains spaces, knowing how to work this seems important. The pyspark solution above from Eugene is a great way to go. – Jamie May 10 '22 at 13:16

0 Answers0