Arrow Julia to Python - Read Record Batch Stream

Question

I am trying to read an arrow file that I wrote as a sequence of record batches in python. For some reason I am only getting the first struct entry. I have verified the files are bigger than one item and of expected size.

with pa.OSFile(input_filepath, 'rb') as source:
    with pa.ipc.open_stream(source) as reader:
        for batch in reader:
            # only one batch here
            my_struct_col = batch.column('col1')
            field1_values = my_struct_col.flatten()
            print(field1_values)

I am writing the file in Julia using:

using Arrow

struct OutputData
    name::String
    age::Int32
end

writer = open(filePath, "w")

data = OutputData("Alex", 20)

for _ = 1:1000
    t = (col1=[data],)
    table = Arrow.Table(Arrow.tobuffer(t))
    Arrow.write(writer, table)
end

close(writer)

I believe both languages are using the streaming IPC format to file.

score 1 · Answer 1 · answered Mar 28 '23 at 13:56

The Arrow.jl package can write arrow files in batches, but will only do so if the source you feed the write method supports the Tables.partitions functionality, as described in the user manual. If you simply write tables to the output file without using the Tables.partitions functionality, Arrow.write will write a complete arrow file to the output file (including the Footer) one after another. When pyarrow reads the file in, it sees the Footer after the first table and stops reading.

The easiest way to make an object that supports the Tables.partitions functionality is to use Tables.partitioner, like so:

using Arrow
using Tables

struct OutputData
    name::String
    age::Int32
end

data = OutputData("Alex", 20)

open("partitioned.arrow", "w") do writer
    t = [(col1=[data],) for _ in 1:1000]
    table = Tables.partitioner(t)
    Arrow.write(writer, table)
end

Arrow Julia to Python - Read Record Batch Stream

1 Answers1