Originally I was writing and reading C++ struct data to file as binary, using reinterpet_cast<T>
. This was good because no code changes were required when a new member was added. The cast handled it automatically.
I'm now writing to a Parquet file using the code from the Apache Github example (below). This is so both C++ and Python can access the data.
However. now each time a new field is added I need to add a new column builder, add another .Finish()
, add the column to the table etc. Then presumably I will need to add the new field on the read side too.
Two questions:
What's the best way to write Parquet data and read it, to minimize maintenance adding new fields in the future?
I want to read the entire Parquet file in to an
std::array<MyStruct>
whereMyStruct
is a row in the Parquet file. Is this possible, given Parquet is columnar-based?
Parquet write example code from Github:
std::shared_ptr<arrow::Table> generate_table()
{
arrow::Int64Builder i64builder;
PARQUET_THROW_NOT_OK(i64builder.AppendValues({1, 2, 3, 4, 5}));
std::shared_ptr<arrow::Array> i64array;
PARQUET_THROW_NOT_OK(i64builder.Finish(&i64array));
arrow::StringBuilder strbuilder; // New field requires one of these
PARQUET_THROW_NOT_OK(strbuilder.Append("some"));
PARQUET_THROW_NOT_OK(strbuilder.Append("string"));
PARQUET_THROW_NOT_OK(strbuilder.Append("content"));
PARQUET_THROW_NOT_OK(strbuilder.Append("in"));
PARQUET_THROW_NOT_OK(strbuilder.Append("rows"));
std::shared_ptr<arrow::Array> strarray;
PARQUET_THROW_NOT_OK(strbuilder.Finish(&strarray)); // and these
// This needs to be updated:
std::shared_ptr<arrow::Schema> schema = arrow::schema({
arrow::field("int", arrow::int64()),
arrow::field("str", arrow::utf8())});
// And this
return arrow::Table::Make(schema, {i64array, strarray});
}
void write_parquet_file(const arrow::Table& table)
{
std::shared_ptr<arrow::io::FileOutputStream> outfile;
PARQUET_ASSIGN_OR_THROW(outfile, arrow::io::FileOutputStream::Open("parquet-arrow-example.parquet"));
PARQUET_THROW_NOT_OK(parquet::arrow::WriteTable(table, arrow::default_memory_pool(), outfile, 3));
}