Split parquet from s3 into chunks

Question

I'm using the following code to read parquet files from s3. Next, I want to iterate over it in chunks. How can I achieve it?

import s3fs
import fastparquet as fp

s3 = s3fs.S3FileSystem()
fs = s3fs.core.S3FileSystem()

bucket, path = 'mybucket', 'mypath'
root_dir_path = f'{bucket}/{path}'
s3_path = f"{root_dir_path}/*.parquet"
all_paths_from_s3 = fs.glob(path=s3_path)

fp_obj = fp.ParquetFile(all_paths_from_s3, open_with=s3.open, root=root_dir_path)
df = fp_obj.to_pandas()

One approach would be using generators:

def chunks(df, chunksize):
    for i in range(0, len(df), chunksize):
        yield df[i:i + chunksize]

for chunk in chunks(df, 1000):
    # dummy code to transform & operate on chunk
    print(len(chunk))
    # dummy code ends

What's a more space & time-efficient approach to this?

Micah Kornfield · Answer 1 · 2021-09-12T18:42:27.217

1

Using pyarrow datasets might be better in terms of memory efficiency.

Something like this should work and only load the batches incrementally:

from pyarrow.dataset import dataset
ds = dataset(f"s3://{root_dir_path}", format="parquet")
batches = ds.to_batches()
for batch in batches:
   df = batch.to_pandas()
   transform(df)

If fastparquet is a requirement using iter_row_groups should be more memory efficient but you still might want to process incrementally.

In both cases this should load data more incrementally instead of creating one large dataframe and then iterating over it.

edited Sep 12 '21 at 18:42

answered Sep 11 '21 at 18:16

Micah Kornfield

1,325
5
10

I had a type in the code which should now be fixed. Please see [the docs](https://arrow.apache.org/docs/python/dataset.html#reading-from-cloud-storage) for additional context on reading from s3 – Micah Kornfield Sep 12 '21 at 18:43

Split parquet from s3 into chunks

1 Answers1