I have a dataset of around 180000000 records in .csv that I transform in hudi parquet through glue job. It's partitioned by one column. It writes all successfully, but it takes too long to read hudi data in glue job (>30min).
I tried to read only one partition with
spark.read.format("hudi").load("s3://somes3bucket").
where("partition1 = 'somevalue')
but there is no difference.
I also tried the incremental read but it always returns zero records.
incremental_read_options = {
'hoodie.datasource.query.type': 'incremental',
'hoodie.datasource.read.begin.instanttime': '000',
'hoodie.datasource.read.incr.path.glob': ''
}
DFhudi = spark.read.format("org.apache.hudi").options(**incremental_read_options).load(path_hudi)
I also have a problem with partition projection in Athena on that table, minimum value of partition is 200000 and maximum is 3500000. When querying with the partition in where condition it works fine, but without it it gives error:
HIVE_EXCEEDED_PARTITION_LIMIT: Query over table 'table' can potentially read more than 1000000 partitions
DDL for partition projection:
TBLPROPERTIES (
'projection.enabled'='true',
'projection.reported_qc_session_id.range'='200000, 4000000',
'projection.reported_qc_session_id.type'='integer',
'storage.location.template'='s3://bucket/table/partition=${partition}')
What can I do to decrease hudi reading time and partition projection problem?