I'm developing a model for financial purpose. I have the entire S&P500 components inside a folder, stored as many .hdf files. Each .hdf file has its own multi-index (year-week-minute).
An example of the sequential code (non parallelized):
import os
from classAsset import Asset
def model(current_period, previous_perdiod):
# do stuff on the current period, based on stats derived from previous_period
return results
if __name__ == '__main__':
for hdf_file in os.listdir('data_path'):
asset = Asset(hdf_file)
for year in asset.data.index.get_level_values(0).unique().values:
for week in asset.data.loc[year].index.get_level_values(0).unique().values:
previous_period = asset.data.loc[(start):(end)].Open.values # start and end are defined in another function
current_period = asset.data.loc[year, week].Open.values
model(current_period, previous_period)
To speed up the process, I'm using multiprocessing.pool to run the same algorithm on multiple .hdf files at the same time, so I'm quite satisfied with the processing speed (I have a 4c/8t CPU). But now I discovered Dask.
In Dask documentation 'DataFrame Overview' they indicate:
Trivially parallelizable operations (fast):
- Elementwise operations: df.x + df.y, df * df
- Row-wise selections: df[df.x > 0]
- Loc: df.loc[4.0:10.5] (this is what interests me the most)
Also, in Dask documentation 'Use Cases' they indicate:
A programmer has a function that they want to run many times on different inputs. Their function and inputs might use arrays or dataframes internally, but conceptually their problem isn’t a single large array or dataframe.
They want to run these functions in parallel on their laptop while they prototype but they also intend to eventually use an in-house cluster. They wrap their function in dask.delayed and let the appropriate dask scheduler parallelize and load balance the work.
So I'm sure I'm missing something, or probably more than just something. What's the difference between processing many single pandas dataframes with multiprocessing.pool and dask.multiprocessing?
Do you think I should use Dask for my specific case? Thank you guys.