How to read all columns from Avro when newer partitions have more columns then older ones?

Question

I've got data in Avro format partitioned by date and time and I receiving new data every hour. Newer partitions can contain more columns then older ones. When I read it by Spark 2.4.3 I got DataFrame with schema of the first(oldest) partition and all newer added columns are lost. What should I do to read all columns? Is there some workaround?

Thanks.

Here it is =) sparkSession.read .format("avro") .load(pathToData) — AndreyFaktor, Nov 18 '19 at 11:47

score -1 · Answer 1 · answered Nov 19 '19 at 15:47

What you are looking for is the ability to merge the schema of the different files Spark reads. You can achieve this with the mergeSchema option. This capability applies to all file-based data sources, not just Avro.

sparkSession.read.format("avro").option("mergeSchema", true).load(pathToData)

How to read all columns from Avro when newer partitions have more columns then older ones?

1 Answers1