I have 20 parquet files each of size about 5GB. I want to count no. of records in the whole dataset.
I have the current code:
from dask.distributed import Client, LocalCluster
cluster = LocalCluster(n_workers=8, threads_per_worker=1)
client = Client(cluster)
import dask.dataframe as dd
df = dd.read_parquet("s3://bucket/2020_03_31/*.parquet", columns=['id'], chunksize="1000MB")
df.count().compute()
But the code hangs and throws out of memory errors. My machine has 16 cores and 64gb ram.
EDIT:
As requested, I removed chunksize argument, but the output still hangs. Even the diagnostics page stops loading. I do not get out of memory errors, but I don't know whats happening.
# Output hangs
df = dd.read_parquet("s3://bucket/2020_03_31/*.parquet", columns=['id'])