Partition pruning not working on Hudi dataset

Question

We have created a Hudi dataset which has two level partition like this

s3://somes3bucket/partition1=value/partition2=value

where partition1 and partition2 is of type string

When running a simple count query using Hudi format in spark-shell, it takes almost 3 minutes to complete

spark.read.format("hudi").load("s3://somes3bucket").
  where("partition1 = 'somevalue' and partition2 = 'somevalue'").
  count()

res1: Long = ####

attempt 1: 3.2 minutes
attempt 2: 2.5 minutes

Here is also the metrics in Spark UI where ~9000 tasks (which is approximately equivalent to the total no of files in the ENTIRE dataset s3://somes3bucket) are used for computation. Seems like spark is reading the entire dataset instead of partition pruning....and then filtering the dataset based on the where clause

Whereas, if I use the parquet format to read the dataset, the query only takes ~30 seconds (vis-a-vis 3 minutes with Hudi format)

spark.read.parquet("s3://somes3bucket").
  where("partition1 = 'somevalue' and partition2 = 'somevalue'").
  count()

res2: Long = ####

~ 30 seconds

Here is the spark UI, where only 1361 files are scanned (vis-a-vis ~9000 files in Hudi) and takes only 15 seconds

Any idea why partition pruning is not working when using Hudi format? Wondering if I am missing any configuration during the creation of the dataset?

PS: I ran this query in emr-6.3.0 which has Hudi version 0.7.0

@Dude0001 we have upgraded to the latest Hudi version. No issues now — Raj, Aug 16 '22 at 21:27

Partition pruning not working on Hudi dataset

0 Answers0