I have a folder containing 497 pandas' dataframes stored as .parquet files. The folder total dimension is 7.6GB.
I'm trying to develop a simple trading system. So I create 2 different classes, the main one is the Portfolio one, this class then creates an Asset object for every single dataframe in the data folder.
import os
import pandas as pd
from dask.delayed import delayed
class Asset(file):
def __init__:
self.data_path = 'path\\to\\data\\folder\\'
self.data = pd.read_parquet(self.data_path + file, engine='auto')
class Portfolio:
def __init__:
self.data_path = 'path\\to\\data\\folder\\'
self.files_list = [file for file in os.listdir(self.data_path) if file.endswith('.parquet')]
self.assets_list = []
self.results = None
self.shared_data = '???'
def assets_loading(self):
for file in self.files_list:
tmp = Asset(file)
self.assets_list.append(tmp)
def dask_delayed(self):
for asset in self.assets_list:
backtest = delayed(self.model)(asset)
def dask_compute(self):
self.results = delayed(dask_delayed)
self.results.compute()
def model(self, asset):
# do shet
if __name__ == '__main__':
portfolio = Portfolio()
portfolio.dask_compute()
I'm doing something wrong cause it looks like the results are not processed. If I try to check portfolio.results the console prints:
Out[5]: Delayed('NoneType-7512ffcc-3b10-445f-928a-f01c01bae29c')
So here are my questions:
- Can you explain me what's wrong?
- When I run the function assets_loading() I'm basically loading the entire data folder in memory for faster processing speed, but it saturates my RAM (16GB available). I didn't thought that a 7.6GB folder could saturates 16GB RAM, that's why I want to use Dask. Any solution compatible with my script work flow?
- There's is another problem and probably the bigger one. With Dask I'm trying to parallelize the model function over multiple assets at the same time, but I need a shared memory (self.shared_data in the script) to store some variables value that resides inside each Dask process to the Portfolio object (for example, the single asset's year performance). Can you explain me how can I share data between Dask delayed processes and how to store this data in a Portfolio's variable?
Thanks a lot.