12

I'm developing a model for financial purpose. I have the entire S&P500 components inside a folder, stored as many .hdf files. Each .hdf file has its own multi-index (year-week-minute).

An example of the sequential code (non parallelized):

import os
from classAsset import Asset


def model(current_period, previous_perdiod):
    # do stuff on the current period, based on stats derived from previous_period
    return results

if __name__ == '__main__':
    for hdf_file in os.listdir('data_path'):
        asset = Asset(hdf_file)
        for year in asset.data.index.get_level_values(0).unique().values:
            for week in asset.data.loc[year].index.get_level_values(0).unique().values:

                previous_period = asset.data.loc[(start):(end)].Open.values  # start and end are defined in another function
                current_period = asset.data.loc[year, week].Open.values

                model(current_period, previous_period)

To speed up the process, I'm using multiprocessing.pool to run the same algorithm on multiple .hdf files at the same time, so I'm quite satisfied with the processing speed (I have a 4c/8t CPU). But now I discovered Dask.

In Dask documentation 'DataFrame Overview' they indicate:

Trivially parallelizable operations (fast):

  • Elementwise operations: df.x + df.y, df * df
  • Row-wise selections: df[df.x > 0]
  • Loc: df.loc[4.0:10.5] (this is what interests me the most)

Also, in Dask documentation 'Use Cases' they indicate:

A programmer has a function that they want to run many times on different inputs. Their function and inputs might use arrays or dataframes internally, but conceptually their problem isn’t a single large array or dataframe.

They want to run these functions in parallel on their laptop while they prototype but they also intend to eventually use an in-house cluster. They wrap their function in dask.delayed and let the appropriate dask scheduler parallelize and load balance the work.

So I'm sure I'm missing something, or probably more than just something. What's the difference between processing many single pandas dataframes with multiprocessing.pool and dask.multiprocessing?

Do you think I should use Dask for my specific case? Thank you guys.

Community
  • 1
  • 1
ilpomo
  • 657
  • 2
  • 5
  • 19

1 Answers1

22

There is no difference. Dask is doing just what you are doing in your custom code. It uses pandas and a thread or multiprocessing pool for parallelism.

You might prefer Dask for a few reasons

  1. It would figure out how to write the parallel algorithms automatically
  2. You may want to scale to a cluster in the future

But if what you have works well for you then I would just stay with that.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • Can you explain me the first point? Or give me a link with a detailed explanation, or an example. Thank you. – ilpomo Oct 18 '17 at 19:40
  • 1
    Dask.dataframe provides parallel algorithms for a large subset of the Pandas API (as well as several other APIs). http://dask.pydata.org/en/latest/dataframe.html – MRocklin Oct 19 '17 at 01:35