pyspark: Get hudi last/latest commit using pyspark

Question

I am doing an incremental query with spark-hudi every hour and saving that incremental query begin and end time in db(say mysql) everytime. For nexti ncemental query I use begin time as end time of previous query fetch from mysql.

incremental query should look like this:

hudi_incremental_read_options = {
    'hoodie.datasource.query.type': 'incremental',
    'hoodie.datasource.read.begin.instanttime': hudi_start_commit,
    'hoodie.datasource.read.end.instanttime': hudi_end_commit
}

but I am not sure how to find hudi_end_commit in pyspark(python). In Java I can do the same with helper class HoodieDataSourceHelpers like:

String hudi_end_commit = HoodieDataSourceHelpers.latestCommit(FileSystem.get(javaSparkContext.hadoopConfiguration()),l1BasePath);

but unable to find a solution to do the same in python.

After a work around I found a solution that is not feasible for a large dataset.

   spark_session.read.format("hudi").load(l1_base_path).createOrReplaceTempView("hudi_trips_snapshot")
commits = list(map(lambda row: row[0],
                   spark_session.sql("select distinct(_hoodie_commit_time) as commitTime from  "
                                     "hudi_trips_snapshot order by commitTime desc").limit(1).collect()))

But when the data size is too large it loaded whole data to get the hudi commits that takes more time than reading the actual data itself.

Is there any easy way to find hudi latest/last commit.

score 0 · Answer 1 · answered May 25 '22 at 18:13

Try this (worked for me in pyspark shell):

hudi_end_commit = spark._sc._gateway.jvm.org.apache.hudi.HoodieDataSourceHelpers.latestCommit(
    spark._sc._gateway.jvm.org.apache.hadoop.fs.FileSystem.get(spark._sc._jsc.hadoopConfiguration()),
    "/path/to/hudi/table"
)

pyspark: Get hudi last/latest commit using pyspark

1 Answers1