AWS Partitioned Hudi

Question

I have a dataset of around 180000000 records in .csv that I transform in hudi parquet through glue job. It's partitioned by one column. It writes all successfully, but it takes too long to read hudi data in glue job (>30min).

I tried to read only one partition with

spark.read.format("hudi").load("s3://somes3bucket").
  where("partition1 = 'somevalue')

but there is no difference.

I also tried the incremental read but it always returns zero records.

incremental_read_options = {
 'hoodie.datasource.query.type': 'incremental',
 'hoodie.datasource.read.begin.instanttime': '000',
 'hoodie.datasource.read.incr.path.glob': ''
 }
 DFhudi = spark.read.format("org.apache.hudi").options(**incremental_read_options).load(path_hudi)

I also have a problem with partition projection in Athena on that table, minimum value of partition is 200000 and maximum is 3500000. When querying with the partition in where condition it works fine, but without it it gives error:

HIVE_EXCEEDED_PARTITION_LIMIT: Query over table 'table' can potentially read more than 1000000 partitions

DDL for partition projection:

TBLPROPERTIES (
  'projection.enabled'='true', 
  'projection.reported_qc_session_id.range'='200000, 4000000', 
  'projection.reported_qc_session_id.type'='integer', 
  'storage.location.template'='s3://bucket/table/partition=${partition}')

What can I do to decrease hudi reading time and partition projection problem?

when you are done writing what is the avg file size in each partition ? If they are too small then you need to regroup the data — Prabhakar Reddy, Oct 04 '21 at 05:42
that's definitely small. how many files per partition? If there are more than one then repartition/coalesce and if only one file per partition then change your partitioning strategy — Prabhakar Reddy, Oct 05 '21 at 04:53

AWS Partitioned Hudi

0 Answers0