Using alias to rename pyspark columns

Question

I'm trying to import a parquet file in Databricks (pyspark) and keep getting the error

df = spark.read.parquet(inputFilePath)

AnalysisException:  Column name "('my data (beta)', "Meas'd Qty")" contains invalid character(s). Please use alias to rename it.

I tried the suggestions in this post, using .withColumnRenamed like in this post, and also using alias like

(spark.read.parquet(inputFilePath)).select(col("('my data (beta)', "Meas'd Qty")").alias("col")).show()

but always get the same error. How do I go through each column to replace any invalid characters with underscore _ or even just delete all invalid characters?

You can refer : https://stackoverflow.com/questions/38191157/spark-dataframe-validating-column-names-for-parquet-writes OR https://stackoverflow.com/questions/45804534/pyspark-org-apache-spark-sql-analysisexception-attribute-name-contains-inv — Karthikeyan Rasipalay Durairaj, Dec 15 '21 at 22:14
thanks, I tried both but get the same error - the error occurs at the line `spark.read.parquet(inputFilePath)` and nothing I do seems to give me a different result — Medulla Oblongata, Dec 15 '21 at 22:16
you can try , df = df.withColumnRenamed("Foo Bar", "foobar") . remove the additional space in column — Karthikeyan Rasipalay Durairaj, Dec 15 '21 at 22:47
can you provide the sample inputfile to get more clarity on the column names? — Nikunj Kakadiya, Dec 16 '21 at 09:40

Rahul Kumar · Answer 1 · 2021-12-16T01:09:37.840

0

How is the old file generated? The file was saved with column names that are not allowed by the spark.

Better to fix this issue at the source when this file is generated.

Few approaches you can try in spark to resolve are

In the select statement put column name in `` like

(spark.read.parquet(inputFilePath)).select(col(`('my data (beta)', "Meas'd Qty")`).alias("col")).show()

Try to rename using toDF. You need to pass all the column names in the output df.

(spark.read.parquet(inputFilePath)).toDF(["col_a", "col_b", ...]).show()

Try reading the file using pyarrow and refactor the columns and save the result. After that read using pysaprk and continue with your tasks.

edited Dec 16 '21 at 01:09

answered Dec 15 '21 at 23:43

Rahul Kumar

2,184
3
24
46

the original files are from a database and I don't control the formatting ... 1. i get invalid syntax error for the `` and 2. invalid syntax error when I put the column name in `""` – Medulla Oblongata Dec 16 '21 at 00:19
Exactly what are the column names in the source ? – Rahul Kumar Dec 16 '21 at 00:23
it's actually a multi index - when I import the parquet file outside databricks, I can see the column names: `MultiIndex([('my data (alpha)', 'Meas'd Qty'), ('my data (alpha)', 'Sch Qty'), ('my data (beta)', 'Meas'd Qty'), ...` – Medulla Oblongata Dec 16 '21 at 01:05
Approach #3, Try reading the file using pyarrow and refactor the columns and save the result. After that read using pyspark and continue with your tasks. – Rahul Kumar Dec 16 '21 at 01:10
thanks, I'll try that. It's disappointing though that databricks can't parse multi index parquet files – Medulla Oblongata Dec 16 '21 at 01:32

Using alias to rename pyspark columns

1 Answers1