0

I am writing parquet clean job using PyArrow. However, I only want to process native parquet files and skip over any .parquet files in iceberg, hudi, or deltalake format. This is because these formats require updates to be done through the API/interface directly; manipulating the .parquet directly will corrupt the table because metadata won't be synced.

I know that for or Hudi, there's columns to indicate that. e.g. _hoodie_commit_time .

import pyarrow.parquet as pq
print(
  pq.read_schema(
    'hudi.parquet'
  ).metadata.keys()
)

yields:

[b'parquet.avro.schema', 
 b'writer.model.name', 
 b'hoodie_bloom_filter_type_code', 
 b'org.apache.hudi.bloomfilter', 
 b'hoodie_min_record_key', 
 b'hoodie_max_record_key']

and in addition, there's additional notable metadata: b'hoodie_min_record_key and b'hoodie_max_record_key.

However, I am wondering if there is any established standards for where these project/vendor specific parquet types are?

or is it all implemented differently in various metadata fields for these parquet derivative formats?

Sam.E
  • 175
  • 2
  • 10

1 Answers1

0

I did some investigation, there's no "standard" way that these projects have.

import pyarrow

def is_hudi_parquet(schema: pyarrow.Schema):
    if schema.metadata:
        for metadata_key in schema.metadata.keys():
            if "hoodie" in str(metadata_key):
                return True
    return False


def is_iceberg_parquet(schema):
    if schema.metadata:
        for metadata_key in schema.metadata.keys():
            if "iceberg" in str(metadata_key):
                return True
    return False

I found this worked for iceberg and Hudi but could not find the same metadata key in delta lake table.

Sam.E
  • 175
  • 2
  • 10