I have a Spark dataframe df
with exactly one column named "My Column Name". It's created by reading in a parquet file.
[edit] The parquet file was created by reading in a CSV file named test.csv containing the following:
My Column Name
test1
test2
and writing it out to a parquet file named test.parquet using pandas pd.to_parquet("test.parquet") [/edit]
The printSchema function returns this:
>>> df.printSchema()
root
|-- My Column Name: string (nullable = true)
I create another dataframe new_df
using withColumnRenamed applied to df:
>>> new_df = df.withColumnRenamed("My Column Name", "my_column_name")
>>> new_df.printSchema()
root
|-- my_column_name: string (nullable = true)
When I try to show the values in ```new_df`` I get an error that refers to the old column name:
>>> new_df.show(2)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/anaconda3/lib/python3.7/site-packages/pyspark/sql/dataframe.py", line 484, in show
print(self._jdf.showString(n, 20, vertical))
File "/opt/anaconda3/lib/python3.7/site-packages/py4j/java_gateway.py", line 1305, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/opt/anaconda3/lib/python3.7/site-packages/pyspark/sql/utils.py", line 117, in deco
raise converted from None
pyspark.sql.utils.AnalysisException: Attribute name "My Column Name" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.
I've tried multiple other methods for renaming the column (creating a temporary view and selecting the column with an alias, using the alias() function) and all lead to the same result. What am I missing?