I tried to have my program write out a stream of data in parquet format via apache arrow's StreamWriter. But the output file do not have the metadata footer. When trying to read in the parquet using python pandas, I get the following error:
Invalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
This JIRA ticket in arrow seems to provide the solution. Stating that the ParquetFileWriter
within the StreamWriter
must be closed to write the footer. (The ticket recommended calling Close()
indirectly by calling StreamWriter
's destructor. But I consistently get a segmentation fault from ParquetFileWriter.Close()
.
Below is how I have the Writers setup:
std::shared_ptr<::arrow::io::FileOutputStream> outfile_{""};
std::string outputFilePath_ = "/tmp/part.0.parquet";
PARQUET_ASSIGN_OR_THROW(
outfile_,
::arrow::io::FileOutputStream::Open(outputFilePath_)
)
// build column names
parquet::schema::NodeVector columnNames_{};
columnNames_.push_back(
parquet::schema::PrimitiveNode::Make(
"Time", parquet::Repetition::REQUIRED, parquet::Type::INT64, parquet::ConvertedType::UINT_64
)
);
columnNames_.push_back(
parquet::schema::PrimitiveNode::Make(
"Value", parquet::Repetition::REQUIRED, parquet::Type::INT64, parquet::ConvertedType::UINT_64
)
);
auto schema = std::static_pointer_cast<parquet::schema::GroupNode>(
parquet::schema::GroupNode::Make("schema", parquet::Repetition::REQUIRED, columnNames_)
);
parquet::WriterProperties::Builder builder;
std::unique_ptr<parquet::ParquetFileWriter> fwriter = parquet::ParquetFileWriter::Open(outfile_, schema, builder.build())
parquet::StreamWriter os_ = parquet::StreamWriter {std::move(fwriter)};
// Start writing to os_, would be in a callback function
os_ << std::uint64_t{5} << std::uint64_t{59};
os_.EndRow();
os_.EndRowGroup();
I have tried the following methods but they all yield a seg fault:
os_.~StreamWriter();
OR
fwriter.Close()