I am writing parquet clean job using PyArrow. However, I only want to process native parquet files and skip over any .parquet files in iceberg, hudi, or deltalake format. This is because these formats require updates to be done through the API/interface directly; manipulating the .parquet directly will corrupt the table because metadata won't be synced.
I know that for or Hudi, there's columns to indicate that. e.g. _hoodie_commit_time
.
import pyarrow.parquet as pq
print(
pq.read_schema(
'hudi.parquet'
).metadata.keys()
)
yields:
[b'parquet.avro.schema',
b'writer.model.name',
b'hoodie_bloom_filter_type_code',
b'org.apache.hudi.bloomfilter',
b'hoodie_min_record_key',
b'hoodie_max_record_key']
and in addition, there's additional notable metadata: b'hoodie_min_record_key
and b'hoodie_max_record_key
.
However, I am wondering if there is any established standards for where these project/vendor specific parquet types are?
or is it all implemented differently in various metadata fields for these parquet derivative formats?