read process and concatenate pandas dataframe in parallel with dask

Question

I'm trying to read and process in parallel a list of csv files and concatenate the output in a single pandas dataframe for further processing.

My workflow consist of 3 steps:

create a series of pandas dataframe by reading a list of csv files (all with the same structure)

def loadcsv(filename): df = pd.read_csv(filename) return df
for each dataframe create a new column by processing 2 existing columns

def makegeom(a,b): return 'Point(%s %s)' % (a,b)

def applygeom(df): df['Geom']= df.apply(lambda row: makegeom(row['Easting'], row['Northing']), axis=1) return df
concatenate all the dataframes in a single dataframe

frames = [] for i in csvtest: df = applygeom(loadcsv(i)) frames.append(df) mergedresult1 = pd.concat(frames)

In my workflow I use pandas (each csv (15) file has more than >> 2*10^6 data points) so it takes a while to complete. I think this kind of workflow should take advantage of some parallel processing (at least for the read_csv and apply steps) so I gave a try to dask, but I was not able to use it properly. In my attempt I did'n gain any improvement in speed.

I made a simple notebook so to replicate what I'm doing:

https://gist.github.com/epifanio/72a48ca970a4291b293851ad29eadb50

My question is ... what's the proper way to use dask to accomplish my use case?

MRocklin · Accepted Answer · 2016-11-04T12:48:06.470

4

Pandas

In Pandas I would use the apply method

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'a': [1, 2, 3], 'b': [3, 2, 1]})

In [3]: def makegeom(row):
   ...:      a, b = row
   ...:      return 'Point(%s %s)' % (a, b)
   ...: 

In [4]: df.apply(makegeom, axis=1)
Out[4]: 
0    Point(1 3)
1    Point(2 2)
2    Point(3 1)
dtype: object

Dask.dataframe

In dask.dataframe you can do the same thing

In [5]: import dask.dataframe as dd

In [6]: ddf = dd.from_pandas(df, npartitions=2)

In [7]: ddf.apply(makegeom, axis=1).compute()
Out[7]: 
0    Point(1 3)
1    Point(2 2)
2    Point(3 1)

Add new series

In either case you can then add the new series to the dataframe

df['geom'] = df[['a', 'b']].apply(makegeom)

Create

If you have CSV data then I would use the dask.dataframe.read_csv function

ddf = dd.read_csv('filenames.*.csv')

If you have other kinds of data then I would use dask.delayed

edited Nov 04 '16 at 12:48

answered Nov 04 '16 at 12:34

MRocklin

55,641
23
163
235

I’m now trying to use delayed: https://gist.github.com/anonymous/34816085f0dfc2e26a5130a59aa920c1 but it is still working … 'htop' show that all the cpu are working but not at 100% I see 8 processes working at ~15% I'll try your approach. My data are stored as feather binary files (In the example I used csv to simplify a little my use case) – epifanio Nov 04 '16 at 12:49
1

Your makegeom function is bound by the GIL. You should read http://dask.readthedocs.io/en/latest/scheduler-choice.html to learn about choosing a good scheduler for your situation. – MRocklin Nov 04 '16 at 14:01
1

I’m still working on my issue. I changed the ```makegeom``` function, replacing ```apply``` with custom ```numpy``` code (which is much faster). Now I'm working on a [new notebook](https://gist.github.com/epifanio/b675c596c3b717ff08bbfef4a36879b2). My plan is to first learn a bit about queues and 'shared objects' among processes then understand how to use dask with distribute. – epifanio Nov 08 '16 at 12:12

score 1 · Answer 2 · answered Sep 18 '20 at 09:57

In the meantime, I have found other ways (alternative to Dask), in my opinion relatively easier, to perform a function func in parallel over a pandas data frame. In both cases, I took advantage of the numpy.array_split method.

One uses a combination of the python multiprocessing.Pool, numpy.array_split and pandas.concat and will work this way:

import numpy as np

def func(array):
    # do some computation on the given array
    pass

def parallelize_dataframe(df, func, n_cores=72):
    df_split = np.array_split(df, n_cores)
    pool = Pool(n_cores)
    df = pd.concat(pool.map(func, df_split))
    pool.close()
    pool.join()
    return df

Another is by using the powerful but simple ray cluster (which is quite useful if you can run the code over multiple machines):

# connect to a ray cluster
# 

import ray

ray.init(address="auto", redis_password="5241590000000000")

import numpy as np


@ray.remote
def func(df):
    # do some computation on the given dataframe
    pass

df_split = np.array_split(df, 288)
result = pd.concat(ray.get([func.remote(i) for i in df_split]))

The methods above are working quite well for simple methods func where the computation can be carried out with numpy and the product which is returned can be concatenated back into a pandas data frame - for methods that do simpler file manipulation I also found useful parmap.map - but that is off-topic for this S.O. question.

read process and concatenate pandas dataframe in parallel with dask

2 Answers2

Pandas

Dask.dataframe

Add new series

Create

Linked