dask dataframe: merge two dataframes, impute missing value and write to csv only use partial CPUs (20% in each CPU)

Question

I want to merge two dask dataframes, impute missing values with column median and export the merged dataframe to csv files. I got one problem: my current code cannot utilize all the 8 CPUs (~20% of each CPU)

I am not sure which part limits the CPU usage. Here is the repeatable code

import numpy as np
import pandas as pd 
df1 = pd.DataFrame(
    np.c_[(np.random.randint(100, size=(10000, 1)), np.random.randn(10000, 3))],
    columns=['id', 'a', 'b', 'c'])
df2 = pd.DataFrame(
    np.c_[(np.array(range(100)), np.random.randn(100, 10000))],
    columns=['id'] + ['d_' + str(i) for i in range(10000)])
df1.id=df1.id.astype(int).astype(object)
df2.id=df2.id.astype(int).astype(object)

## some cells are missing in df2
df2.iloc[:, 1:] = df2.iloc[:,1:].mask(np.random.random(df2.iloc[:, 1:].shape) < .05)

## dask codes starts here
import dask.dataframe as dd
from dask.distributed import Client
ddf1 = dd.from_pandas(df1, npartitions=3)
ddf2 = dd.from_pandas(df2, npartitions=3)
ddf = ddf1.merge(ddf2, how='left', on='id')
ddf = ddf.fillna(ddf.quantile())
ddf.to_csv('train_*.csv', index=None, header=None)

Although all the 8 CPUs are invoked to use, only ~20% of each CPU is utilized. Can I code to improve the CPU usage?

score 1 · Answer 1 · answered Apr 09 '19 at 19:33

Firstly, not that if you don't specify otherwise, Dask will use threads for execution. In threads, only one python operation can occur at a time (the "GIL"), except some lower-level code which explicitly releases the lock. The "merge" operation involves a lot of shuffling of data in memory, and I suspect releases the lock some of the time.

Secondly, all of the output is being written to the filesystem, so you will always have a bottleneck here: however fast other processing may be, you still need to feed all of it through the storage bus.

If the CPUs are working ~20%, I daresay this is still faster than a single-core version? Put simply, some workloads just parallelise better than others.

dask dataframe: merge two dataframes, impute missing value and write to csv only use partial CPUs (20% in each CPU)

1 Answers1