0

I'm trying to import a parquet file in Databricks (pyspark) and keep getting the error

df = spark.read.parquet(inputFilePath)

AnalysisException:  Column name "('my data (beta)', "Meas'd Qty")" contains invalid character(s). Please use alias to rename it. 

I tried the suggestions in this post, using .withColumnRenamed like in this post, and also using alias like

(spark.read.parquet(inputFilePath)).select(col("('my data (beta)', "Meas'd Qty")").alias("col")).show()

but always get the same error. How do I go through each column to replace any invalid characters with underscore _ or even just delete all invalid characters?

Medulla Oblongata
  • 3,771
  • 8
  • 36
  • 75

1 Answers1

0

How is the old file generated? The file was saved with column names that are not allowed by the spark.

Better to fix this issue at the source when this file is generated.

Few approaches you can try in spark to resolve are

  1. In the select statement put column name in `` like
(spark.read.parquet(inputFilePath)).select(col(`('my data (beta)', "Meas'd Qty")`).alias("col")).show()
  1. Try to rename using toDF. You need to pass all the column names in the output df.
(spark.read.parquet(inputFilePath)).toDF(["col_a", "col_b", ...]).show()
  1. Try reading the file using pyarrow and refactor the columns and save the result. After that read using pysaprk and continue with your tasks.
Rahul Kumar
  • 2,184
  • 3
  • 24
  • 46
  • the original files are from a database and I don't control the formatting ... 1. i get invalid syntax error for the `` and 2. invalid syntax error when I put the column name in `""` – Medulla Oblongata Dec 16 '21 at 00:19
  • Exactly what are the column names in the source ? – Rahul Kumar Dec 16 '21 at 00:23
  • it's actually a multi index - when I import the parquet file outside databricks, I can see the column names: `MultiIndex([('my data (alpha)', 'Meas'd Qty'), ('my data (alpha)', 'Sch Qty'), ('my data (beta)', 'Meas'd Qty'), ...` – Medulla Oblongata Dec 16 '21 at 01:05
  • Approach #3, Try reading the file using pyarrow and refactor the columns and save the result. After that read using pyspark and continue with your tasks. – Rahul Kumar Dec 16 '21 at 01:10
  • thanks, I'll try that. It's disappointing though that databricks can't parse multi index parquet files – Medulla Oblongata Dec 16 '21 at 01:32