3

I have 20 parquet files each of size about 5GB. I want to count no. of records in the whole dataset.

I have the current code:

from dask.distributed import Client, LocalCluster

cluster = LocalCluster(n_workers=8, threads_per_worker=1)
client = Client(cluster)

import dask.dataframe as dd

df = dd.read_parquet("s3://bucket/2020_03_31/*.parquet", columns=['id'], chunksize="1000MB")
df.count().compute()

But the code hangs and throws out of memory errors. My machine has 16 cores and 64gb ram.

EDIT:

As requested, I removed chunksize argument, but the output still hangs. Even the diagnostics page stops loading. I do not get out of memory errors, but I don't know whats happening.

# Output hangs
df = dd.read_parquet("s3://bucket/2020_03_31/*.parquet", columns=['id'])
Nihhaar
  • 141
  • 11

1 Answers1

0

I recommend removing the chunksize argument. By doing this you're asking Dask to aggregate many row groups into single tasks, which might overwhelm your memory.

MRocklin
  • 55,641
  • 23
  • 163
  • 235