Querying Apache Hudi using PySpark on EMR by table name

Question

While writing data to the Apache Hudi on EMR using PySpark, we can specify the configuration to save to a table name.

See

hudiOptions = {
'hoodie.table.name': 'tableName',
'hoodie.datasource.write.recordkey.field': 'id',
'hoodie.datasource.write.partitionpath.field': 'creation_date',
'hoodie.datasource.write.precombine.field': 'last_update_time',
'hoodie.datasource.hive_sync.enable': 'true',
'hoodie.datasource.hive_sync.table': 'tableName',
'hoodie.datasource.hive_sync.partition_fields': 'creation_date',
'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor'
}

# Write a DataFrame as a Hudi dataset
inputDF.write \
.format('org.apache.hudi') \
.option('hoodie.datasource.write.operation', 'insert') \
.options(**hudiOptions) \
.mode('overwrite') \
.save('s3://DOC-EXAMPLE-BUCKET/myhudidataset/')

I see the options hoodie.table.name. In my case, I have written data from multiple tables into the same path, but with different tableNames.

However, when I query a dataframe from the base path like:

snapshotQueryDF = spark.read \
    .format('org.apache.hudi') \
    .load('s3://DOC-EXAMPLE-BUCKET/myhudidataset' + '/*/*')
    
snapshotQueryDF.show()

I get results from all the tables. Is there any way to filter data from only a particular tableName?

Searching through the configuration on Apache Hudi, I don't see anything that helps me.

Can we just put table names in the S3 path so data reside in different prefix? — lsc, May 18 '23 at 16:25
Yes, that's what I'd thought I'd do if I couldn't find a better solution. But somehow, I feel I could make use of the `tableName` configuration we are specifying. — Anurag A S, May 19 '23 at 10:37

Raphael Sant'Anna · Answer 1 · 2023-05-21T12:46:45.977

0

You can try using spark.read.table. It will load the table's metadata from the metastore, as long as it is registered.

Your code would look like this:

snapshotQueryDF = spark.read.table('<database>.<table>')
    
snapshotQueryDF.show()

edited May 21 '23 at 12:46

answered May 21 '23 at 12:46

Raphael Sant'Anna

1
2

Querying Apache Hudi using PySpark on EMR by table name

1 Answers1