Count no. of rows from large parquet files using dask without memory errors

Question

I have 20 parquet files each of size about 5GB. I want to count no. of records in the whole dataset.

I have the current code:

from dask.distributed import Client, LocalCluster

cluster = LocalCluster(n_workers=8, threads_per_worker=1)
client = Client(cluster)

import dask.dataframe as dd

df = dd.read_parquet("s3://bucket/2020_03_31/*.parquet", columns=['id'], chunksize="1000MB")
df.count().compute()

But the code hangs and throws out of memory errors. My machine has 16 cores and 64gb ram.

EDIT:

As requested, I removed chunksize argument, but the output still hangs. Even the diagnostics page stops loading. I do not get out of memory errors, but I don't know whats happening.

# Output hangs
df = dd.read_parquet("s3://bucket/2020_03_31/*.parquet", columns=['id'])

score 0 · Answer 1 · answered Apr 04 '20 at 16:41

0

I recommend removing the chunksize argument. By doing this you're asking Dask to aggregate many row groups into single tasks, which might overwhelm your memory.

answered Apr 04 '20 at 16:41

MRocklin

55,641
23
163
235

Count no. of rows from large parquet files using dask without memory errors

1 Answers1