0

For Multi-cluster writes in S3, Delta lake uses Dynamo Db to atomically check if the file is present before writing it because S3 not supporting the “put-if-absent” consistency guarantee. Therefore, in order to leverage this feature using Delta lake for concurrent writes, we need DynamoDb which is an extra cost for us to maintain. So I would like to check how this works with Hudi

So similarly does Hudi also require DynamoDb to do multi-writes to S3? or any other instead DynamoDb?

I don't see anything mentioned specifically for hudi to do multi-writes to the same table in s3

Raj
  • 11
  • 2

1 Answers1

0

Hudi supports several types of locking, check doc

But it seems that for aws ecosystem - DynamoDB is the best choice as aws suggests

Currently Hive Metastore locking is not working properly with Glue (check this issue)

But you can try to use FileSystem based lock

  • Does S3 supports for file system lock provider, either onPremise s3 or cloud? Because it mentioned clearly in hudi docs that cloud s3 not gone support, FileSystem based lock provider is not supported with cloud storage like S3 or GCS. – Raj Jun 23 '23 at 13:59
  • I was thinking about hoodie.write.lock.filesystem.path , because local /tmp/ folder can be accessed from withing glue, check here https://www.linkedin.com/pulse/tips-aws-glue-pyspark-orchestration-kishore-kumar-mohan -- but no guaranties it will work. Btw here an interesting article on concurrent writes without any lock - https://medium.com/@simpsons/can-you-concurrently-write-data-to-apache-hudi-w-o-any-lock-provider-51ea55bf2dd6 , also can't guarantee it'll work, but you can try – ground_control Aug 03 '23 at 10:22