-1

I have a pipeline of processes that do different stuff. One of the pipes reads a file and de-compresses it into a buffer. The buffer in question contains an arrow Table. There is a component that takes this buffer and returns a table with the data.

The code works fine but there is a big performance issue as the code needs to write the data into a file and then process it to read it with the Arrow file stream classes.

I've been trying to find in the documentation a way to perform this action without the need of writing the data into a file on disk. I was not able to make it work with a RecordBatchStreamReader or any of the alternatives I've found in the docs.

Any working example showing how to avoid this write? Is this even possible?

Here's the code in question:

auto read(std::span<const char> buffer) -> std::shared_ptr<arrow::Table> {

    // Wasting resources creating a temporal file...
    auto const output = "/tmp/random.arrow";
    auto stream       = std::ofstream(output, std::ios::binary | std::ios::out);
    stream.write(buffer.data(), buffer.size());
    stream.flush();
    stream.close();

    // Read the file and convert it into an arrow::Table
    auto const file_stream = ::arrow::io::ReadableFile::Open(output);
    auto const ipc_reader  = ::arrow::ipc::RecordBatchFileReader::Open(file_stream.ValueUnsafe());
    if (!ipc_reader.ok()) {
        return nullptr;
    }

    auto const reader             = ipc_reader.ValueOrDie();
    auto const num_record_batches = reader->num_record_batches();
    auto batches                  = std::vector<std::shared_ptr<::arrow::RecordBatch>>(num_record_batches);
    for (auto i = 0; i < num_record_batches; ++i) {
        auto const batch = reader->ReadRecordBatch(i);
        if (!batch.ok()) {
            return nullptr;
        }
        batches[i] = batch.ValueOrDie();
    }

    auto const creation = ::arrow::Table::FromRecordBatches(batches);
    if (!creation.ok()) {
        return nullptr;

    }
    return creation.ValueOrDie();
}
mohabouje
  • 3,867
  • 2
  • 14
  • 28

1 Answers1

0

The current API does not support creating the table without intermediate buffers with n aIPC file format, but the step of writing the data into a file can be avoided:

auto read(std::span<const char> buffer) -> std::shared_ptr<arrow::Table> {

    ::arrow::Buffer arrow_buffer(buffer.data(), buffer.size());
    ::arrow::io::BufferReader buffer_reader(arrow_buffer);
    auto const ipc_reader = ::arrow::ipc::RecordBatchFileReader::Open(&buffer_reader);
    if (!ipc_reader.ok()) {
        return nullptr;
    }

    auto const reader             = ipc_reader.ValueOrDie();
    auto const num_record_batches = reader->num_record_batches();
    auto batches                  = std::vector<std::shared_ptr<::arrow::RecordBatch>>(num_record_batches);
    for (auto i = 0; i < num_record_batches; ++i) {
        auto const batch = reader->ReadRecordBatch(i);
        if (!batch.ok()) {
            return nullptr;
        }
        batches[i] = batch.ValueOrDie();
    }

    auto const creation = ::arrow::Table::FromRecordBatches(batches);
    if (!creation.ok()) {
        return nullptr;

    }
    return creation.ValueOrDie();
}

There is a new API that should avoid the need to create a temporal buffer. To follow the change, please see

mohabouje
  • 3,867
  • 2
  • 14
  • 28