I'm using the following code to read parquet files from s3. Next, I want to iterate over it in chunks. How can I achieve it?
import s3fs
import fastparquet as fp
s3 = s3fs.S3FileSystem()
fs = s3fs.core.S3FileSystem()
bucket, path = 'mybucket', 'mypath'
root_dir_path = f'{bucket}/{path}'
s3_path = f"{root_dir_path}/*.parquet"
all_paths_from_s3 = fs.glob(path=s3_path)
fp_obj = fp.ParquetFile(all_paths_from_s3, open_with=s3.open, root=root_dir_path)
df = fp_obj.to_pandas()
One approach would be using generators:
def chunks(df, chunksize):
for i in range(0, len(df), chunksize):
yield df[i:i + chunksize]
for chunk in chunks(df, 1000):
# dummy code to transform & operate on chunk
print(len(chunk))
# dummy code ends
What's a more space & time-efficient approach to this?