I have a pipeline of processes that do different stuff. One of the pipes reads a file and de-compresses it into a buffer. The buffer in question contains an arrow Table. There is a component that takes this buffer and returns a table with the data.
The code works fine but there is a big performance issue as the code needs to write the data into a file and then process it to read it with the Arrow file stream classes.
I've been trying to find in the documentation a way to perform this action without the need of writing the data into a file on disk. I was not able to make it work with a RecordBatchStreamReader or any of the alternatives I've found in the docs.
Any working example showing how to avoid this write? Is this even possible?
Here's the code in question:
auto read(std::span<const char> buffer) -> std::shared_ptr<arrow::Table> {
// Wasting resources creating a temporal file...
auto const output = "/tmp/random.arrow";
auto stream = std::ofstream(output, std::ios::binary | std::ios::out);
stream.write(buffer.data(), buffer.size());
stream.flush();
stream.close();
// Read the file and convert it into an arrow::Table
auto const file_stream = ::arrow::io::ReadableFile::Open(output);
auto const ipc_reader = ::arrow::ipc::RecordBatchFileReader::Open(file_stream.ValueUnsafe());
if (!ipc_reader.ok()) {
return nullptr;
}
auto const reader = ipc_reader.ValueOrDie();
auto const num_record_batches = reader->num_record_batches();
auto batches = std::vector<std::shared_ptr<::arrow::RecordBatch>>(num_record_batches);
for (auto i = 0; i < num_record_batches; ++i) {
auto const batch = reader->ReadRecordBatch(i);
if (!batch.ok()) {
return nullptr;
}
batches[i] = batch.ValueOrDie();
}
auto const creation = ::arrow::Table::FromRecordBatches(batches);
if (!creation.ok()) {
return nullptr;
}
return creation.ValueOrDie();
}