2

Originally I was writing and reading C++ struct data to file as binary, using reinterpet_cast<T>. This was good because no code changes were required when a new member was added. The cast handled it automatically.

I'm now writing to a Parquet file using the code from the Apache Github example (below). This is so both C++ and Python can access the data.

However. now each time a new field is added I need to add a new column builder, add another .Finish(), add the column to the table etc. Then presumably I will need to add the new field on the read side too.

Two questions:

  1. What's the best way to write Parquet data and read it, to minimize maintenance adding new fields in the future?

  2. I want to read the entire Parquet file in to an std::array<MyStruct> where MyStruct is a row in the Parquet file. Is this possible, given Parquet is columnar-based?

Parquet write example code from Github:

std::shared_ptr<arrow::Table> generate_table() 
{
  arrow::Int64Builder i64builder;
  PARQUET_THROW_NOT_OK(i64builder.AppendValues({1, 2, 3, 4, 5}));
  std::shared_ptr<arrow::Array> i64array;
  PARQUET_THROW_NOT_OK(i64builder.Finish(&i64array));

  arrow::StringBuilder strbuilder;                          // New field requires one of these
  PARQUET_THROW_NOT_OK(strbuilder.Append("some"));
  PARQUET_THROW_NOT_OK(strbuilder.Append("string"));
  PARQUET_THROW_NOT_OK(strbuilder.Append("content"));
  PARQUET_THROW_NOT_OK(strbuilder.Append("in"));
  PARQUET_THROW_NOT_OK(strbuilder.Append("rows"));
  std::shared_ptr<arrow::Array> strarray;
  PARQUET_THROW_NOT_OK(strbuilder.Finish(&strarray));       // and these


  //                                                        This needs to be updated:
  std::shared_ptr<arrow::Schema> schema = arrow::schema({
                                                          arrow::field("int", arrow::int64()), 
                                                          arrow::field("str", arrow::utf8())});

  //                                      And this
  return arrow::Table::Make(schema, {i64array, strarray});
}

void write_parquet_file(const arrow::Table& table) 
{
  std::shared_ptr<arrow::io::FileOutputStream> outfile;
  PARQUET_ASSIGN_OR_THROW(outfile, arrow::io::FileOutputStream::Open("parquet-arrow-example.parquet"));
  PARQUET_THROW_NOT_OK(parquet::arrow::WriteTable(table, arrow::default_memory_pool(), outfile, 3));
}
user997112
  • 29,025
  • 43
  • 182
  • 361
  • Reminder: the compiler may insert padding between members, for example, alignment purposes. Therefore binary writing (mirror image) to a file or modeling a data structure in a binary file is not recommended. Also, be aware of Endianess. – Thomas Matthews Jun 23 '20 at 00:17

0 Answers0