I read my parquet data as follows:
file_names = glob.glob('./events/*/*/*/*/*/part*.parquet')
pf = fp.ParquetFile(file_names, root='./events')
pf.cats = {'customer': pf.cats['customer']}
dfs = (delayed(pf.read_row_group_file)(rg, pf.columns, pf.cats) for rg in pf.row_groups)
df = dd.from_delayed(dfs)
I can't use dd.read_parquet
, because my parquet is partitioned, and I want to avoid loading some of the categoricals.
I have two questions here:
How can I tell Dask what number of partitions I want my dataframe to have?
How many partitions will Dask create by default?