I'm looking for fast ways to store and retrieve numpy
array using pyarrow
. I'm pretty satisfied with retrieval. It takes less than 1 second to extract columns from my .arrow
file that contains 1.000.000.000 integers of dtype = np.uint16
.
import pyarrow as pa
import numpy as np
def write(arr, name):
arrays = [pa.array(col) for col in arr]
names = [str(i) for i in range(len(arrays))]
batch = pa.RecordBatch.from_arrays(arrays, names=names)
with pa.OSFile(name, 'wb') as sink:
with pa.RecordBatchStreamWriter(sink, batch.schema) as writer:
writer.write_batch(batch)
def read(name):
source = pa.memory_map(name, 'r')
table = pa.ipc.RecordBatchStreamReader(source).read_all()
for i in range(table.num_columns):
yield table.column(str(i)).to_numpy()
arr = np.random.randint(65535, size=(250, 4000000), dtype=np.uint16)
%%timeit -r 1 -n 1
write(arr, 'test.arrow')
>>> 25.6 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
%%timeit -r 1 -n 1
for n in read('test.arrow'): n
>>> 901 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
Can efficiency of writing to .arrow
format be improved? In addition, I tested np.save
:
%%timeit -r 1 -n 1
np.save('test.npy', arr)
>>> 18.5 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
It looks a little bit faster. Can we optimise Apache Arrow for better writing into .arrow
format further?