0

While writing data to the Apache Hudi on EMR using PySpark, we can specify the configuration to save to a table name.

See

hudiOptions = {
'hoodie.table.name': 'tableName',
'hoodie.datasource.write.recordkey.field': 'id',
'hoodie.datasource.write.partitionpath.field': 'creation_date',
'hoodie.datasource.write.precombine.field': 'last_update_time',
'hoodie.datasource.hive_sync.enable': 'true',
'hoodie.datasource.hive_sync.table': 'tableName',
'hoodie.datasource.hive_sync.partition_fields': 'creation_date',
'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor'
}

# Write a DataFrame as a Hudi dataset
inputDF.write \
.format('org.apache.hudi') \
.option('hoodie.datasource.write.operation', 'insert') \
.options(**hudiOptions) \
.mode('overwrite') \
.save('s3://DOC-EXAMPLE-BUCKET/myhudidataset/')

I see the options hoodie.table.name. In my case, I have written data from multiple tables into the same path, but with different tableNames.

However, when I query a dataframe from the base path like:

snapshotQueryDF = spark.read \
    .format('org.apache.hudi') \
    .load('s3://DOC-EXAMPLE-BUCKET/myhudidataset' + '/*/*')
    
snapshotQueryDF.show()

I get results from all the tables. Is there any way to filter data from only a particular tableName?

Searching through the configuration on Apache Hudi, I don't see anything that helps me.

Anurag A S
  • 725
  • 10
  • 23
  • 1
    Can we just put table names in the S3 path so data reside in different prefix? – lsc May 18 '23 at 16:25
  • Yes, that's what I'd thought I'd do if I couldn't find a better solution. But somehow, I feel I could make use of the `tableName` configuration we are specifying. – Anurag A S May 19 '23 at 10:37

1 Answers1

0

You can try using spark.read.table. It will load the table's metadata from the metastore, as long as it is registered.

Your code would look like this:

snapshotQueryDF = spark.read.table('<database>.<table>')
    
snapshotQueryDF.show()