I am trying to do a basic ETL workflow on large files across workers using dask-cudf
across a large amount of workers .
Problem:
Initially the scheduler
schedules equal amounts of partitions
to be read across workers but during the pre-processing it tends to distribute/shuffle them across workers.
The minimum number of partitions that a worker gets is 4
and the maximum partitions that it gets is 19
(total partitions
= apprx. 300
, num_workers
= 22
) this behavior causes problem downstream as i want equal distribution of partitions across workers.
Is there a way to prevent this behavior ?
I thought below will help with that but it does not .
# limit work-stealing as much as possible
dask.config.set({'distributed.scheduler.work-stealing': False})
dask.config.set({'distributed.scheduler.bandwidth': 1})
Workflow being done:
- read
- fill-na
- down-casting/other logic
df = dask_cudf.read_csv(path = `big_files`,
names = names,
delimiter='\t',
dtype = read_dtype_ls,
chunksize=chunksize)
df = df.map_partitions(lambda df:df.fillna(-1))
def transform_col_int64_to_int32(df, columns):
"""
This function casts int64s columns to int32s
we are using this to transform int64s to int32s and overflows seem to be consitent
"""
for col in columns:
df[col] = df[col].astype(np.int32)
return df
df = df.map_partitions(transform_col_int64_to_int32,cat_col_names)
df = df.persist()