1

I read my parquet data as follows:

file_names = glob.glob('./events/*/*/*/*/*/part*.parquet')
pf = fp.ParquetFile(file_names, root='./events')
pf.cats = {'customer': pf.cats['customer']}
dfs = (delayed(pf.read_row_group_file)(rg, pf.columns, pf.cats) for rg in pf.row_groups)
df = dd.from_delayed(dfs)

I can't use dd.read_parquet, because my parquet is partitioned, and I want to avoid loading some of the categoricals.

I have two questions here:

  • How can I tell Dask what number of partitions I want my dataframe to have?

  • How many partitions will Dask create by default?

j-bennet
  • 310
  • 3
  • 11

1 Answers1

2

First, I suspect that the dd.read_parquet function works fine with partitioned or multi-file parquet datasets.

Second, if you are using dd.from_delayed, then each delayed call results in one partition. So in this case you have as many partitions as you have elements of the dfs iterator. If you wish to change this you can call the repartition method afterwards.

MRocklin
  • 55,641
  • 23
  • 163
  • 235