I'm trying to create an iceberg table format on cloud object storage.
In the below image we can see that iceberg table format needs a catalog. This catalog stores current metadata pointer, which points to the latest metadata. The Iceberg quick start doc lists JDBC, Hive MetaStore, AWS Glue, Nessie and HDFS as list of catalogs that can be used.
My goal is to store the current metadata pointer(version-hint.text) along with rest of the table data(metadata, manifest lists, manifest, parquet data files) in the object store itself.
With HDFS as the catalog, there’s a file called version-hint.text in the table’s metadata folder whose contents is the version number of the current metadata file.
Looking at HDFS as one of the possible catalogs, I should be able to use ADLS or S3 to store the current metadata pointer along with rest of the data. For example: spark connecting to ADLS using ABFSS interface and creating iceberg table along with catalog.
My question is
- Is it safe to use version hint file as current metadata pointer in ADLS/S3? Will I lose any of the iceberg features if I do this? Looking at this comment from one of the contributors suggests that its not ideal for production.
The version hint file is used for Hadoop tables, which are named that way because they are intended for HDFS. We also use them for local FS tests, but they can't be safely used concurrently with S3. For S3, you'll need a metastore to enforce atomicity when swapping table metadata locations. You can use the one in iceberg-hive to use the Hive metastore.
- Looking at comments on this thread, Is version-hint.text file optional?
we iterate through on the possible metadata locations and stop only if there is not new snapshot is available
Could someone please clarify?
I'm trying to do a POC with Iceberg. At this point the requirement is to be able to write new data from data bricks to the table at least every 10 mins. This frequency might increase in the future.
The data once written will be read by databricks and dremio.