1

I'm using the following code to read parquet files from s3. Next, I want to iterate over it in chunks. How can I achieve it?

import s3fs
import fastparquet as fp

s3 = s3fs.S3FileSystem()
fs = s3fs.core.S3FileSystem()

bucket, path = 'mybucket', 'mypath'
root_dir_path = f'{bucket}/{path}'
s3_path = f"{root_dir_path}/*.parquet"
all_paths_from_s3 = fs.glob(path=s3_path)

fp_obj = fp.ParquetFile(all_paths_from_s3, open_with=s3.open, root=root_dir_path)
df = fp_obj.to_pandas()

One approach would be using generators:

def chunks(df, chunksize):
    for i in range(0, len(df), chunksize):
        yield df[i:i + chunksize]

for chunk in chunks(df, 1000):
    # dummy code to transform & operate on chunk
    print(len(chunk))
    # dummy code ends

What's a more space & time-efficient approach to this?

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
ProgramSpree
  • 372
  • 5
  • 21

1 Answers1

1

Using pyarrow datasets might be better in terms of memory efficiency.

Something like this should work and only load the batches incrementally:

from pyarrow.dataset import dataset
ds = dataset(f"s3://{root_dir_path}", format="parquet")
batches = ds.to_batches()
for batch in batches:
   df = batch.to_pandas()
   transform(df)

If fastparquet is a requirement using iter_row_groups should be more memory efficient but you still might want to process incrementally.

In both cases this should load data more incrementally instead of creating one large dataframe and then iterating over it.

Micah Kornfield
  • 1,325
  • 5
  • 10
  • I had a type in the code which should now be fixed. Please see [the docs](https://arrow.apache.org/docs/python/dataset.html#reading-from-cloud-storage) for additional context on reading from s3 – Micah Kornfield Sep 12 '21 at 18:43