Spark avro predicate pushdown

Asked Aug 08 '18 at 13:46

Active Aug 08 '18 at 13:51

Viewed 754 times

We are using Avro data format and the data is partitioned by year, month, day, hour, min

I see the data stored in HDFS as

/data/year=2018/month=01/day=01/hour=01/min=00/events.avro

And we load the data using

val schema = new Schema.Parser().parse(this.getClass.getResourceAsStream("/schema.txt"))
val df = spark.read.format("com.databricks.spark.avro").option("avroSchema",schema.toString).load("/data")

And then using predicate push down for filtering the data -

var x = isInRange(startDate, endDate)($"year", $"month", $"day", $"hour", $"min")
df = tableDf.filter(x)

Can someone explain what is happening behind the scenes? I want to specifically understand when does the filtering of input files happen and where? Interestingly, when I print the schema, the fields year, month, day and hour are automatically added, i.e the actual data does not contain these columns. Does Avro add these fields? Want to understand clearly how files are filtered and how the partitions are created.

edited Aug 08 '18 at 13:51

pagoda_5b

7,333
1
27
40

asked Aug 08 '18 at 13:46

Vijay Muvva

1,063
1
17
31

The partition directories are read as psuedo columns. That's why you are able to use them in a query. – philantrovert Aug 08 '18 at 14:12

Spark avro predicate pushdown

0 Answers0