2

I have a Spark dataframe df with exactly one column named "My Column Name". It's created by reading in a parquet file.

[edit] The parquet file was created by reading in a CSV file named test.csv containing the following:

My Column Name
test1
test2

and writing it out to a parquet file named test.parquet using pandas pd.to_parquet("test.parquet") [/edit]

The printSchema function returns this:

>>> df.printSchema()
root
 |-- My Column Name: string (nullable = true)

I create another dataframe new_df using withColumnRenamed applied to df:

>>> new_df = df.withColumnRenamed("My Column Name", "my_column_name")
>>> new_df.printSchema()
root
 |-- my_column_name: string (nullable = true)

When I try to show the values in ```new_df`` I get an error that refers to the old column name:

>>> new_df.show(2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/anaconda3/lib/python3.7/site-packages/pyspark/sql/dataframe.py", line 484, in show
    print(self._jdf.showString(n, 20, vertical))
  File "/opt/anaconda3/lib/python3.7/site-packages/py4j/java_gateway.py", line 1305, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/opt/anaconda3/lib/python3.7/site-packages/pyspark/sql/utils.py", line 117, in deco
    raise converted from None
pyspark.sql.utils.AnalysisException: Attribute name "My Column Name" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.

I've tried multiple other methods for renaming the column (creating a temporary view and selecting the column with an alias, using the alias() function) and all lead to the same result. What am I missing?

Sky McKinley
  • 21
  • 1
  • 3
  • If possible could you post a sample of data so we can reproduce the behaviour ? Anyway, are you able to call show on df before renaming ? – chateaur May 19 '22 at 07:15
  • I cannot reproduce the problem either. The following line perfectly shows the dataframe: ```spark.createDataFrame([], '`My Column Name` string').withColumnRenamed("My Column Name", "my_column_name").show()``` – ZygD May 19 '22 at 08:06
  • @chateaur I edited my post to include how I created the parquet file. Calling show on the df produces the same error in my post. – Sky McKinley May 19 '22 at 15:03
  • It happens in the AWS Glue notebook – Vijay Anand Pandian Oct 04 '22 at 13:37

1 Answers1

-1

As mentionned in related SO question , you must assign the renamed column to the original df :
df = df.withColumnRenamed("My Column Name", "my_column_name")
instead of creating a new_df.

chateaur
  • 346
  • 1
  • 13
  • Assigning the result of the withColumnRenamed method to *any* variable would solve the problem if that were the case. There's nothing magical about reassigning it back to the same dataframe. – Sky McKinley May 24 '22 at 15:26
  • I believe there is : " This solution works for me, just need to repeat for all columns that have a space. Important to assign back to the variable: df, without that it continues to fail with the same error. – ktang Sep 17, 2020 at 11:54 " – chateaur May 30 '22 at 15:47