2

Hey guys I am using the parquet_cpp's StreamWriter, but the output file is not empty. Even the header was not written, as the file was a 4-byte file.

std::shared_ptr<::arrow::io::FileOutputStream> outfile_{""};
std::string outputFilePath_ = "/tmp/part.0.parquet";
PARQUET_ASSIGN_OR_THROW(
    outfile_,
    ::arrow::io::FileOutputStream::Open(outputFilePath_)
)
// build column names
parquet::schema::NodeVector columnNames_{};
columnNames_.push_back(
    parquet::schema::PrimitiveNode::Make(
        "Time", parquet::Repetition::REQUIRED, parquet::Type::INT64, parquet::ConvertedType::UINT_64
    )
);
columnNames_.push_back(
    parquet::schema::PrimitiveNode::Make(
        "Value", parquet::Repetition::REQUIRED, parquet::Type::INT64, parquet::ConvertedType::UINT_64
    )
);
auto schema = std::static_pointer_cast<parquet::schema::GroupNode>(
    parquet::schema::GroupNode::Make("schema", parquet::Repetition::REQUIRED, columnNames_)
);
parquet::WriterProperties::Builder builder;
parquet::StreamWriter os_ = parquet::StreamWriter {parquet::ParquetFileWriter::Open(outfile_, schema, builder.build())};

// Start writing to os_, would be in a callback function
os_ << std::uint64_t{5} << std::uint64_t{59} << parquet::EndRow;

I seem to be missing something trivial for the column names and data to be written out, but I could not find anything online. Thank you.

michaelgbj
  • 290
  • 1
  • 10
  • 2
    The code seems reasonable. Have you tried closing the parquetfilewriter and output stream explicitly? It's not clear how you do this in a callback. But rows are buffered in memory up to a certain threshold before flushing a row group. You can also flush a row group explicitly with the Stream writer (this shouldn't be done for every row though) – Micah Kornfield Mar 16 '22 at 08:26
  • you are right. I am still in the process of troubleshooting this to a complete solution. As far as I know, I needed the parquet::EndRowGroup for the data to be written out AND I needed to call StreamWriter's destructor to close its ParquetFileWriter, which writes the parquet footer into the file. I am having a segfault in the destructor right now. – michaelgbj Mar 16 '22 at 15:59

1 Answers1

1

Yeah. The RowGroup must be flushed too. So all I need is to have:

os_.EndRowGroup();

While the data is written out, the parquet file's footer is corrupted and could not be read. I posted a question HERE on this writing out footer issue.

michaelgbj
  • 290
  • 1
  • 10