1

I am trying to do a basic ETL workflow on large files across workers using dask-cudf across a large amount of workers .

Problem:

Initially the scheduler schedules equal amounts of partitions to be read across workers but during the pre-processing it tends to distribute/shuffle them across workers.

The minimum number of partitions that a worker gets is 4 and the maximum partitions that it gets is 19 (total partitions = apprx. 300, num_workers = 22) this behavior causes problem downstream as i want equal distribution of partitions across workers.

Is there a way to prevent this behavior ?

I thought below will help with that but it does not .

# limit work-stealing as much as possible
dask.config.set({'distributed.scheduler.work-stealing': False})
dask.config.set({'distributed.scheduler.bandwidth': 1})

Workflow being done:

  • read
  • fill-na
  • down-casting/other logic

df = dask_cudf.read_csv(path = `big_files`,
                        names = names,
                        delimiter='\t',
                        dtype = read_dtype_ls,
                        chunksize=chunksize)


df = df.map_partitions(lambda df:df.fillna(-1))

def transform_col_int64_to_int32(df, columns):
    """
        This function casts int64s columns to int32s 
        we are using this to transform int64s to int32s and overflows seem to be consitent
    """
    for col in columns:
        df[col] = df[col].astype(np.int32)
    return df

df = df.map_partitions(transform_col_int64_to_int32,cat_col_names)
df = df.persist()

Vibhu Jawa
  • 88
  • 9

1 Answers1

1

Dask schedules where tasks run based on a number of factors, including data dependencies, runtime, memory use, and so on. Typically the answer to these questions is "just let it do it's thing". The most common cause of poor scheduling is having too few chunks.

However, if you explicitly need a more rebalanced distribution then you can try the Client.rebalance method.

wait(df)
client.rebalance(df)

However beware that rebalance is not as robust as other Dask operations. It's best to do it at a time when there isn't a ton of other work going on (hence the call to dask.distributed.wait above).

Also, I would turn on work stealing. Work stealing is another name for load balancing.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • I tried `re balancing` but it has two problems: 1. It takes quite a bit time (30s+) to do re balance and still does end up with equal number of partitions. 2. If i call re balance twice to have a better chance at balancing, it gives me a error. Will raise a github issue around this soon. – Vibhu Jawa Oct 04 '19 at 18:51
  • Well then, the next question would be to ask why those tasks finished in that order. Perhaps there are some heavy dependencies to some of the tasks that encourage them to stay on one machine? Perhaps you just have few tasks relative to the number of computing threads that you have, and so random chance is your enemy? In general, Dask doesn't try to do things optimally. It strives to make good enough decisions quickly. – MRocklin Oct 04 '19 at 19:05