Here is the summary of what I'm doing:
At first, I do this by normal multiprocessing and pandas package:
Step 1. Get the list of files name which I'm gonna to read
import os
files = os.listdir(DATA_PATH + product)
Step 2. loop over the list
from multiprocessing import Pool
import pandas as pd
def readAndWriteCsvFiles(file):
### Step 2.1 read csv file into dataframe
data = pd.read_csv(DATA_PATH + product + "/" + file, parse_dates=True, infer_datetime_format=False)
### Step 2.2 do some calculation
### .......
### Step 2.3 write the dataframe to csv to another folder
data.to_csv("another folder/"+file)
if __name__ == '__main__':
cl = Pool(4)
cl.map(readAndWriteCsvFiles, files, chunksize=1)
cl.close()
cl.join()
The code works fine, but it's very slow.
It needs about 1000 second to do the task.
Compare to R programme using library(parallel)
and parSapply
function.
The R programme only takes about 160 seconds.
So then I tried with dask.delayed and dask.dataframe with following code:
Step 1. Get the list of files name which I'm gonna to read
import os
files = os.listdir(DATA_PATH + product)
Step 2. loop over the list
from dask.delayed import delayed
import dask.dataframe as dd
from dask import compute
def readAndWriteCsvFiles(file):
### Step 2.1 read csv file into dataframe
data = dd.read_csv(DATA_PATH + product + "/" + file, parse_dates=True, infer_datetime_format=False, assume_missing=True)
### Step 2.2 do some calculation
### .......
### Step 2.3 write the dataframe to csv to another folder
data.to_csv(filename="another folder/*", name_function=lambda x: file)
compute([delayed(readAndWriteCsvFiles)(file) for file in files])
This time, I found if I commented out both step 2.3 in dask code and pandas code, dask would run way more faster then normal pandas and multiprocessing.
But if I invoke the to_csv method, then dask is as slow as pandas.
Any solution?
Thanks