I have a daily job which converts avro to parquet.
Avro file per hour is 20G and is partitioned by year, month, day and hour
when I read the avro file like the way below,
spark.read.format("com.databricks.spark.avro").load(basePath).where($year=2020 and $month=9 and $day=1 and $hour=1).write.paritionBy(paritionCol).parquet(path)
- the job runs for 1.5 hours
Note: The whole folder basePath has 36 TB of data in avro format
But, the below command runs for just 7 minutes for the same spark configuration(memory and instances etc.).
spark.read.format("com.databricks.spark.avro").option("basePath", basePath).load(basePath + "year=2020/month=9/day=1/hour=1/").write.paritionBy(paritionCol).parquet(path)
.
Why there is such a drastic reduction of time? How avro does partition pruning internally?