3

We are using Avro data format and the data is partitioned by year, month, day, hour, min

I see the data stored in HDFS as

/data/year=2018/month=01/day=01/hour=01/min=00/events.avro

And we load the data using

val schema = new Schema.Parser().parse(this.getClass.getResourceAsStream("/schema.txt"))
val df = spark.read.format("com.databricks.spark.avro").option("avroSchema",schema.toString).load("/data")

And then using predicate push down for filtering the data -

var x = isInRange(startDate, endDate)($"year", $"month", $"day", $"hour", $"min")
df = tableDf.filter(x)

Can someone explain what is happening behind the scenes? I want to specifically understand when does the filtering of input files happen and where? Interestingly, when I print the schema, the fields year, month, day and hour are automatically added, i.e the actual data does not contain these columns. Does Avro add these fields? Want to understand clearly how files are filtered and how the partitions are created.

pagoda_5b
  • 7,333
  • 1
  • 27
  • 40
Vijay Muvva
  • 1,063
  • 1
  • 17
  • 31

0 Answers0