3

If I write a simple Parquet file using the script simple-write-parquet.cpp, I expect to have a simple Parquet file with a single column MyInt. The script simple-write-parquet.cpp attempts to add KeyValueMetadata to the field MyInt with some dummy values. In the C++ code, if I do,

std::cout << field->ToString(true) << std::endl;

I see the expected return.

...
-- metadata --
foo: bar
bar: foo

and I expect that this metadata will be preserved in the output Parquet file.

However, when I attempt to read this file back using pyarrow, this field metadata key-value pair does not seem to exist:

import pyarrow as pa
import pyarrow.parquet as pq

table = pq.read_table("test.parquet")
field = table.field("MyInt")
field.metadata # None!

Is there a way to retrieve from within pyarrow the KeyValueMetadata attached to both fields and schema (e.g. via the WithMetadata methods) from the C++ side writing out the Parquet files to disk?

dantrim
  • 31
  • 1

1 Answers1

1

It looks like the metadata isn't saved by default. Try to turn on store_schema on in the ArrowWriterProperties


void write_parquet_file(const arrow::Table& table)
{
    std::shared_ptr<arrow::io::FileOutputStream> outfile;
    PARQUET_ASSIGN_OR_THROW(outfile, arrow::io::FileOutputStream::Open("test.parquet"));
    PARQUET_THROW_NOT_OK(parquet::arrow::WriteTable(
        table,
        arrow::default_memory_pool(),
        outfile,
        3,
        parquet::default_writer_properties(),
        parquet::ArrowWriterProperties::Builder().store_schema()->build()));
}

It should work:

>>> table.field('MyInt').metadata
{b'PARQUET:field_id': b'1', b'bar': b'foo', b'foo': b'bar'}

Note that parquet also adds some metadata which you will have to filter out.

0x26res
  • 11,925
  • 11
  • 54
  • 108