I am new to spark and I ran into a problem when appending new data to a partition. My pipeline ingresses daily CSVs into Azure Datalake (basically HDFS) using Databricks. I also run some simple transformations on the data and remove duplicates etc. However, I noticed that sometimes the inferSchema=True
option is not always the best and sometimes creates inconsistencies in schemas between the partitioned files. When I then go to read all the files:
df = sqlContext.read.parquet("path/to/directory")
I am hit with a:
Parquet column cannot be converted in file path/to/directory/file
Column: [Ndc], Expected: LongType, Found: BINARY
I have a ton of partitioned files and going through each one to find if the schema is the same and fixing each one is probably not efficient. Is there an easy way to enforce a schema which all the files will convert to or do you literally have to iterate through each parquet file and change the schema?
Using spark 2.3.1
Thanks.