4

I am new to spark and I ran into a problem when appending new data to a partition. My pipeline ingresses daily CSVs into Azure Datalake (basically HDFS) using Databricks. I also run some simple transformations on the data and remove duplicates etc. However, I noticed that sometimes the inferSchema=True option is not always the best and sometimes creates inconsistencies in schemas between the partitioned files. When I then go to read all the files:

df = sqlContext.read.parquet("path/to/directory")

I am hit with a:

Parquet column cannot be converted in file path/to/directory/file
Column: [Ndc], Expected: LongType, Found: BINARY

I have a ton of partitioned files and going through each one to find if the schema is the same and fixing each one is probably not efficient. Is there an easy way to enforce a schema which all the files will convert to or do you literally have to iterate through each parquet file and change the schema?

Using spark 2.3.1

Thanks.

Aaron Arima
  • 164
  • 1
  • 11

1 Answers1

4

You can try two options.

  1. You "mergeSchema" option to merge two files with different schema https://spark.apache.org/docs/2.3.1/sql-programming-guide.html#schema-merging

  2. Loop thru each individual file, use inferSchema when reading and then explicitly cast to common schema and write back to another location

Manoj Singh
  • 1,627
  • 12
  • 21