Fastest way to iterate Pyarrow Table

Question

I am using Pyarrow library for optimal storage of Pandas DataFrame. I need to process pyarrow Table row by row as fast as possible without converting it to pandas DataFrame (it won't fit in memory). Pandas has iterrows()/iterrtuples() methods. Is there any fast way to iterate Pyarrow Table except for-loop and index addressing?

Bolo · Answer 1 · 2019-04-11T22:34:35.023

8

This code worked for me:

for batch in table.to_batches():
    d = batch.to_pydict()
    for c1, c2, c3 in zip(d['c1'], d['c2'], d['c3']):
        # Do something with the row of c1, c2, c3

edited Apr 11 '19 at 22:34

answered Apr 11 '19 at 13:04

Bolo

11,542
7
41
60

score 6 · Answer 2 · answered Apr 16 '20 at 21:57

If you have a large parquet data set split into mupltiple files, this seems reasonably fast and memory-efficient.

import argparse
import pyarrow.parquet as pq
from glob import glob


def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument('parquet_dir')
    return parser.parse_args()


def iter_parquet(dirpath):
    for fpath in glob(f'{dirpath}/*.parquet'):
        tbl = pq.ParquetFile(fpath)

        for group_i in range(tbl.num_row_groups):
            row_group = tbl.read_row_group(group_i)

            for batch in row_group.to_batches():
                for row in zip(*batch.columns):
                    yield row


if __name__ == '__main__':
    args = parse_args()

    total_count = 0
    for row in iter_parquet(args.parquet_dir):
        total_count += 1
    print(total_count)

score 4 · Accepted Answer · answered Nov 06 '18 at 08:54

4

The software is not optimized at all for this use case at the moment. I would recommend using Cython or C++ or interact with the data row by row. If you have further questions, please reach out on the developer mailing list dev@arrow.apache.org

answered Nov 06 '18 at 08:54

Wes McKinney

101,437
32
142
108

2

Is the answer any different today? – russellpierce Oct 14 '22 at 01:42

Fastest way to iterate Pyarrow Table

3 Answers3