3

I have a folder containing 497 pandas' dataframes stored as .parquet files. The folder total dimension is 7.6GB.

I'm trying to develop a simple trading system. So I create 2 different classes, the main one is the Portfolio one, this class then creates an Asset object for every single dataframe in the data folder.

import os
import pandas as pd
from dask.delayed import delayed


class Asset(file):
    def __init__:
        self.data_path = 'path\\to\\data\\folder\\'
        self.data = pd.read_parquet(self.data_path + file, engine='auto')

class Portfolio:
    def __init__:
        self.data_path = 'path\\to\\data\\folder\\'
        self.files_list = [file for file in os.listdir(self.data_path) if file.endswith('.parquet')]
        self.assets_list = []
        self.results = None
        self.shared_data = '???'

    def assets_loading(self):
        for file in self.files_list:
            tmp = Asset(file)
            self.assets_list.append(tmp)

    def dask_delayed(self):
        for asset in self.assets_list:
            backtest = delayed(self.model)(asset)

    def dask_compute(self):
        self.results = delayed(dask_delayed)
        self.results.compute()

    def model(self, asset):
        # do shet

if __name__ == '__main__':
    portfolio = Portfolio()
    portfolio.dask_compute()

I'm doing something wrong cause it looks like the results are not processed. If I try to check portfolio.results the console prints:

Out[5]: Delayed('NoneType-7512ffcc-3b10-445f-928a-f01c01bae29c')

So here are my questions:

  1. Can you explain me what's wrong?
  2. When I run the function assets_loading() I'm basically loading the entire data folder in memory for faster processing speed, but it saturates my RAM (16GB available). I didn't thought that a 7.6GB folder could saturates 16GB RAM, that's why I want to use Dask. Any solution compatible with my script work flow?
  3. There's is another problem and probably the bigger one. With Dask I'm trying to parallelize the model function over multiple assets at the same time, but I need a shared memory (self.shared_data in the script) to store some variables value that resides inside each Dask process to the Portfolio object (for example, the single asset's year performance). Can you explain me how can I share data between Dask delayed processes and how to store this data in a Portfolio's variable?

Thanks a lot.

ilpomo
  • 657
  • 2
  • 5
  • 19

1 Answers1

2

There are a few things wrong with the line self.results = delayed(dask_delayed):

  • Here you are creating a delayed function, not a delayed result; you need to call the delayed function
  • dask_delayed is not defined here, you probably mean self.dask_delayed
  • the method dask_delayed does not return anything
  • you call .compute() (which doesn't exist for a delayed function, only a delayed result), but don't store the output - computing doesn't happen in-place, as you seem to assume.

You probably wanted

self.result = delayed(self.dask_delayed)().compute()

Now you need to fix dask_delayed(), so that it return something. It should not be calling more delayed functions, since it itself is already to be delayed.

Finally, for filling up memory with pd.read_parquet, it does not surprise me that the in-memory version of the data is bigger, compression/encoding is one of the aims of the parquet format. You could try using dask.dataframe.read_parquet, which is lazy/on-demand.

mdurant
  • 27,272
  • 5
  • 45
  • 74