0

I am trying to rename some special characters from my spark dataframe. For some weird reason, it shows the updated column name when I print the schema, but any attempt to access the data results in an error complaining about the old column name. Here is what I am trying:

# Original Schema
upsertDf.columns

# Output: ['col 0', 'col (0)', 'col {0}', 'col =0', 'col, 0', 'col; 0']

for c in upsertDf.columns:
    upsertDf = upsertDf.withColumnRenamed(c, c.replace(" ", "_").replace("(","__").replace(")","__").replace("{","___").replace("}","___").replace(",","____").replace(";","_____").replace("=","_"))
upsertDf.columns

# Works and returns expected result
# Output: ['col_0', 'col___0__', 'col____0___', 'col__0', 'col_____0', 'col______0']

# Print contents of dataframe
# Throws error for original attribute name "
upsertDf.show()

AnalysisException: 'Attribute name "col 0" contains invalid character(s) among " ,;{}()\\n\\t=". Please use alias to rename it.;'

I have tried other options to rename the column (using alias etc...) and they all return the same error. Its almost as if the show operation is using a cached version of the schema but I can't figure out how to force it to use the new names.

Has anyone run into this issue before?

jawsnnn
  • 314
  • 2
  • 11

1 Answers1

0

Have a look at this minimal example (using your renaming code, ran in a pyspark shell version 3.3.1):

df = spark.createDataFrame(
    [("test", "test", "test", "test", "test", "test")],
    ['col 0', 'col (0)', 'col {0}', 'col =0', 'col, 0', 'col; 0']
)

df.columns
['col 0', 'col (0)', 'col {0}', 'col =0', 'col, 0', 'col; 0']

for c in df.columns:
    df = df.withColumnRenamed(c, c.replace(" ", "_").replace("(","__").replace(")","__").replace("{","___").replace("}","___").replace(",","____").replace(";","_____").replace("=","_"))

df.columns
['col_0', 'col___0__', 'col____0___', 'col__0', 'col_____0', 'col______0']

df.show()
+-----+---------+-----------+------+---------+----------+
|col_0|col___0__|col____0___|col__0|col_____0|col______0|
+-----+---------+-----------+------+---------+----------+
| test|     test|       test|  test|     test|      test|
+-----+---------+-----------+------+---------+----------+

As you see, this executes successfully. So your renaming functionality is OK.

Since you haven't shared all your code (how upsertDf is defined), we can't really know exactly what's going on. But looking at your error message, this comes from ParquetSchemaConverter.scala in a Spark version earlier than 3.2.0 (this error message changed in 3.2.0, see SPARK-34402).

Make sure that you read in your data and then immediately rename the columns, without doing any other operation.

Koedlt
  • 4,286
  • 8
  • 15
  • 33
  • I have been reading up on this the past couple of hours, and I think you are right that this is a parquet reader error, not a Spark error. I should have specified this in my question. I don't see a workaround here, other than not using Parquet since the error happens at read. – jawsnnn Jan 26 '23 at 16:55
  • If you read up on [this SO question](https://stackoverflow.com/questions/45804534/pyspark-org-apache-spark-sql-analysisexception-attribute-name-contains-inv) it seems like renaming your columns should do the trick. Is your `upsertDf` simply the result from `spark.read.parquet`? – Koedlt Jan 26 '23 at 17:51
  • Yes. upsertDf is the result of spark.read.parquet. And I did try renaming the columns, but it looks like that doesn't work whatever option you choose when the underlying source of data is parquet. From what I understand the "conversion" of col names happens when data is read and that's where the error is coming from so it's pretty much unavoidable. – jawsnnn Jan 30 '23 at 11:23