How to fix inconsistent schemas in parquet file partition using Spark

Question

I am new to spark and I ran into a problem when appending new data to a partition. My pipeline ingresses daily CSVs into Azure Datalake (basically HDFS) using Databricks. I also run some simple transformations on the data and remove duplicates etc. However, I noticed that sometimes the inferSchema=True option is not always the best and sometimes creates inconsistencies in schemas between the partitioned files. When I then go to read all the files:

df = sqlContext.read.parquet("path/to/directory")

I am hit with a:

Parquet column cannot be converted in file path/to/directory/file
Column: [Ndc], Expected: LongType, Found: BINARY

I have a ton of partitioned files and going through each one to find if the schema is the same and fixing each one is probably not efficient. Is there an easy way to enforce a schema which all the files will convert to or do you literally have to iterate through each parquet file and change the schema?

Using spark 2.3.1

Thanks.

If you know the schema beforehand then specify it with schema option and get rid of the inferschema option — sramalingam24, Dec 04 '18 at 05:34
I did that for newer data but what about the files that I already wrote? — Aaron Arima, Dec 04 '18 at 08:15
Same should work with older files, if not you can try to use casting like spark.read.parquet().with column("Ndc",$"Ndc".cast(Long type)). — sramalingam24, Dec 04 '18 at 12:15

score 4 · Accepted Answer · answered Dec 04 '18 at 17:12

4

You can try two options.

You "mergeSchema" option to merge two files with different schema https://spark.apache.org/docs/2.3.1/sql-programming-guide.html#schema-merging
Loop thru each individual file, use inferSchema when reading and then explicitly cast to common schema and write back to another location

answered Dec 04 '18 at 17:12

Manoj Singh

1,627
12
21

Yeah unfortunately the mergeSchema option failed so I'll get to work on looping through each file! – Aaron Arima Dec 04 '18 at 19:53

How to fix inconsistent schemas in parquet file partition using Spark

1 Answers1