1

I have a daily job which converts avro to parquet. Avro file per hour is 20G and is partitioned by year, month, day and hour when I read the avro file like the way below, spark.read.format("com.databricks.spark.avro").load(basePath).where($year=2020 and $month=9 and $day=1 and $hour=1).write.paritionBy(paritionCol).parquet(path) - the job runs for 1.5 hours Note: The whole folder basePath has 36 TB of data in avro format

But, the below command runs for just 7 minutes for the same spark configuration(memory and instances etc.). spark.read.format("com.databricks.spark.avro").option("basePath", basePath).load(basePath + "year=2020/month=9/day=1/hour=1/").write.paritionBy(paritionCol).parquet(path). Why there is such a drastic reduction of time? How avro does partition pruning internally?

Gladiator
  • 354
  • 3
  • 19

1 Answers1

0

there are a huge difference.

In the first case you will read all file then filter, in the second case you will read only the selected file (the filter is already done by the partitioning).

you can inspect if the filter is predicate pushdown by using explain() function. In your FileScan avro you will see PushedFilters and PartitionFilters

In your case, your filter is not predicate pushdown.

You can find more informations here : https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-Optimizer-PushDownPredicate.html

maxime G
  • 1,660
  • 1
  • 10
  • 27
  • Thanks @maxime G It would be great if you can provide some links or documentation how avro works internally for more information than to be restricted to my question? – Gladiator Sep 28 '20 at 14:53
  • this is not about avro, it's same "problem" with parquet and other source. you can find more inforamtions here: https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-Optimizer-PushDownPredicate.html – maxime G Sep 28 '20 at 15:00
  • The documentation says that the predicates are used to make the search more efficient and to exclude unnecessary data before buffering into memory. But still why does it take 1.5 hrs? What is not working here? – Gladiator Sep 29 '20 at 15:32