How should I write multiple CSV files efficiently using dask.dataframe?

Question

Here is the summary of what I'm doing:

At first, I do this by normal multiprocessing and pandas package:

Step 1. Get the list of files name which I'm gonna to read

import os    
files = os.listdir(DATA_PATH + product)

Step 2. loop over the list

from multiprocessing import Pool
import pandas as pd    

def readAndWriteCsvFiles(file):
    ### Step 2.1 read csv file into dataframe 
    data = pd.read_csv(DATA_PATH + product + "/" + file, parse_dates=True, infer_datetime_format=False)

    ### Step 2.2 do some calculation
    ### .......

    ### Step 2.3 write the dataframe to csv to another folder
    data.to_csv("another folder/"+file)

if __name__ == '__main__':
    cl = Pool(4)
    cl.map(readAndWriteCsvFiles, files, chunksize=1)
    cl.close()
    cl.join()

The code works fine, but it's very slow.

It needs about 1000 second to do the task.

Compare to R programme using library(parallel) and parSapply function.

The R programme only takes about 160 seconds.

So then I tried with dask.delayed and dask.dataframe with following code:

Step 1. Get the list of files name which I'm gonna to read

import os    
files = os.listdir(DATA_PATH + product)

Step 2. loop over the list

from dask.delayed import delayed
import dask.dataframe as dd
from dask import compute

def readAndWriteCsvFiles(file):
    ### Step 2.1 read csv file into dataframe 
    data = dd.read_csv(DATA_PATH + product + "/" + file, parse_dates=True, infer_datetime_format=False, assume_missing=True)

    ### Step 2.2 do some calculation
    ### .......

    ### Step 2.3 write the dataframe to csv to another folder
    data.to_csv(filename="another folder/*", name_function=lambda x: file)

compute([delayed(readAndWriteCsvFiles)(file) for file in files])

This time, I found if I commented out both step 2.3 in dask code and pandas code, dask would run way more faster then normal pandas and multiprocessing.

But if I invoke the to_csv method, then dask is as slow as pandas.

Any solution?

Thanks

score 2 · Answer 1 · answered Sep 15 '18 at 10:23

2

Reading and writing CSV files is often bound by the GIL. You might want to try parallelizing with processes rather than with threads (the default for dask delayed).

You can achieve this by adding the scheduler='processes' keyword to your compute call.

compute([delayed(readAndWriteCsvFiles)(file) for file in files], scheduler='processes')

See scheduling documentation for more information

Also, note that you're not using dask.dataframe here, but rather dask.delayed.

answered Sep 15 '18 at 10:23

MRocklin

55,641
23
163
235

I tested your method. In terms of only reading CSV, your method faster than both my method faster than pure python single thread "map" function. But when I add the "to_csv" to write CSV files, turns out the single thread python "map" function is the fastest. Very strange. – TianYu Jiang Sep 15 '18 at 13:34

How should I write multiple CSV files efficiently using dask.dataframe?

Step 1. Get the list of files name which I'm gonna to read

Step 2. loop over the list

Step 1. Get the list of files name which I'm gonna to read

Step 2. loop over the list

1 Answers1

Linked