Renaming a column when opening a Parquet file in Pyspark / AWS Glue makes all data null

Question

I have a snappy.parquet file which I need to open as a DataFrame in spark, then upload to a database.
Two of the column names contain spaces (" ").
Using

emps = glueContext.create_dynamic_frame.from_catalog(database=db_name, table_name=tbl_emps)

Gives and error - column names contain invalid characters.
Using

df_emps = spark.read.parquet(file)
for c in df_emps.columns:
   df_emps = df_emps.withColumnRenamed(c, c.replace(" ", ""))
df_emps = spark.read.schema(df_emps.schema).parquet(file)

reads the file and creates the dataframe, but the two columns that contained spaces ar now null.

How can I read this file into a dataframe and retain the content of these fields?

`df_emps = spark.read.schema(df_emps.schema).parquet(file)` looks unnecessary to me. after line 3 the dataframe should be okay. In line 4, it nulled those columns because they no longer exist (they are renamed in df_emps.schema) — mck, Dec 06 '20 at 12:34
An error was encountered: 'Attribute name "First Name" contains invalid character(s) among " ,;{}()\\n\\t=". Please use alias to rename it.;' — NigelLegg, Dec 06 '20 at 14:30
could you show the code for uploading to database? I don't see df_emps being used in your first code snippet. — mck, Dec 06 '20 at 14:32
It's not there because I haven't done it yet, because of the error. I have been using an AWS Glue Dev Endpoint and Jupyter notebook. I have the above code, then df_emps.show(), which show a dataframe with the two columns of Null. — NigelLegg, Dec 06 '20 at 15:03

score 0 · Answer 1 · answered Feb 03 '23 at 12:38

0

Since you are using AWS Glue, you can switch to using AWS Glue v4 which includes Spark 3.3.0 where the issue was fixed.

answered Feb 03 '23 at 12:38

Andrey

59,039
12
119
163

Renaming a column when opening a Parquet file in Pyspark / AWS Glue makes all data null

1 Answers1