3

I tried to have my program write out a stream of data in parquet format via apache arrow's StreamWriter. But the output file do not have the metadata footer. When trying to read in the parquet using python pandas, I get the following error:

Invalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

This JIRA ticket in arrow seems to provide the solution. Stating that the ParquetFileWriter within the StreamWriter must be closed to write the footer. (The ticket recommended calling Close() indirectly by calling StreamWriter's destructor. But I consistently get a segmentation fault from ParquetFileWriter.Close().

Below is how I have the Writers setup:

std::shared_ptr<::arrow::io::FileOutputStream> outfile_{""};
std::string outputFilePath_ = "/tmp/part.0.parquet";
PARQUET_ASSIGN_OR_THROW(
    outfile_,
    ::arrow::io::FileOutputStream::Open(outputFilePath_)
)
// build column names
parquet::schema::NodeVector columnNames_{};
columnNames_.push_back(
    parquet::schema::PrimitiveNode::Make(
        "Time", parquet::Repetition::REQUIRED, parquet::Type::INT64, parquet::ConvertedType::UINT_64
    )
);
columnNames_.push_back(
    parquet::schema::PrimitiveNode::Make(
        "Value", parquet::Repetition::REQUIRED, parquet::Type::INT64, parquet::ConvertedType::UINT_64
    )
);
auto schema = std::static_pointer_cast<parquet::schema::GroupNode>(
    parquet::schema::GroupNode::Make("schema", parquet::Repetition::REQUIRED, columnNames_)
);
parquet::WriterProperties::Builder builder;
std::unique_ptr<parquet::ParquetFileWriter> fwriter = parquet::ParquetFileWriter::Open(outfile_, schema, builder.build())
parquet::StreamWriter os_ = parquet::StreamWriter {std::move(fwriter)};

// Start writing to os_, would be in a callback function
os_ << std::uint64_t{5} << std::uint64_t{59};
os_.EndRow();
os_.EndRowGroup();

I have tried the following methods but they all yield a seg fault: os_.~StreamWriter(); OR fwriter.Close()

Olaf Kock
  • 46,930
  • 8
  • 59
  • 90
michaelgbj
  • 290
  • 1
  • 10
  • Could this be caused by an uncaught exception on fwriter.Close()? – Micah Kornfield Mar 16 '22 at 19:12
  • should I be expecting an exception? – michaelgbj Mar 16 '22 at 19:46
  • 1
    Closing the filewriter is what actually writes the footer so it can throw an exception. I would expect fwriter.Close() to segfault since it is no longer valid after the move to construct the stream writer. Also, be careful with callback functions, the writer is not thread-safe as far as I remember. – Micah Kornfield Mar 17 '22 at 05:21
  • 1
    Even if i call os_.~StreamWriter(), I will encounter the same problem. The non-thread-safe issue might be the source of this. Still exploring solution. – michaelgbj Mar 17 '22 at 14:55

1 Answers1

0

Initially, I had the same problem with yours. After carefully comparing the code of yours (as well as mine) and the code on JIRA ticket, the difference is that builder is a shared_pointer. Therefore, by changing the shared_pointer to a normal parquet::WriterProperties::Builder builder; and later using builder.build(), things should then work out fine.

Teddy van Jerry
  • 176
  • 1
  • 3
  • 11