0

I've got data in Avro format partitioned by date and time and I receiving new data every hour. Newer partitions can contain more columns then older ones. When I read it by Spark 2.4.3 I got DataFrame with schema of the first(oldest) partition and all newer added columns are lost. What should I do to read all columns? Is there some workaround?

Thanks.

1 Answers1

-1

What you are looking for is the ability to merge the schema of the different files Spark reads. You can achieve this with the mergeSchema option. This capability applies to all file-based data sources, not just Avro.

sparkSession.read.format("avro").option("mergeSchema", true).load(pathToData)

Sim
  • 13,147
  • 9
  • 66
  • 95