spark.read.schema return null for dataframe column values

Question

I have small question and issue which I hope spark gurus can help me in

I have parquet file person.parquet that has multiple column with one row. one of the column "Middle Name" has space in the column name which cause issue with spark when writing it to parquet format

what i have done is to rename the column to remove the space as below

SourceData = SourceData.withColumnRenamed("Middle Name","MiddleName")

if i tried to write SourceData to parquet file, it still returns error

Caused by: org.apache.spark.sql.AnalysisException: Attribute name "Middle Name" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.

so i use below which solve the issue

SourceData = spark.read.schema(SourceData.schema).parquet(TestingPath)

but unfortunately the file generated has null value for column MiddleName.

Any suggestion on how to solve this issue?

by the way, i have tried solutions in similar issues here https://stackoverflow.com/questions/38191157/spark-dataframe-validating-column-names-for-parquet-writes — Moe, Dec 14 '21 at 22:59
I figured out the solution which is 1) Read parquet file using pandas instead of spark 2) Convert it into spark data frame 3) Rename your "Middle Name" column that has space in column name to "MiddleName" the idea is not to read parquet using spark — Moe, Dec 15 '21 at 00:39

score 0 · Answer 1 · answered Dec 15 '21 at 01:05

0

Try to quote the column name with a pair of backticks (`).

`Middle Name`

answered Dec 15 '21 at 01:05

过过招

3,722
2
4
11

spark.read.schema return null for dataframe column values

1 Answers1