0

References:


A few things to note:

  • I've got a total of 48gb of ram

  • Here are my versions of the libraries im using

    • Python 3.7.7
    • dask==2.18.0
    • fbprophet==0.6
    • pandas==1.0.3

The reason im import pandas is for this line only
pd.options.mode.chained_assignment = None
This helps with dask erroring when im using dask.distributed

So, I have a 21gb csv file that I am reading using dask and jupyter notebook... I've tried to read it from my mysql database table, however, the kernel eventually crashes

I've tried multiple combinations of using my local network of workers, threads, and available memory, available storage_memory, and even tried not using distributed at all. I have also tried chunking with pandas (not with the line mentioned above related to pandas), however, even with chunking, the kernel still crashes...

I can now load the csv with dask, and apply a few transformations, such as setting the index, adding the column (names) that fbprophet requires... but I am still not able to compute the dataframe with df.compute(), as this is why I think I am receiving the error I am with fbprophet. After I have added the columns y, and ds, with the appropriate dtypes, I receive the error Truth of Delayed objects is not supported, and I think this is because fbprophet expects the dataframe to not be lazy, which is why im trying to run compute beforehand. I have also bumped up the ram on the client to allow it to use the full 48gb, as I suspected that it may be trying to load the data twice, however, this still failed, so most likely this wasn't the case / isn't causing the problem.

Alongside this, fbpropphet is also mentioned in the documentation of dask for applying machine learning to dataframes, however, I really don't understand why this isn't working... I've also tried modin with ray, and with dask, with basically the same result.

Another question... regarding memory usage distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 32.35 GB -- Worker memory limit: 25.00 GB I am getting this error when assigning the client, reading the csv file, and applying operations/transformations to the dataframe, however the allotted size is larger than the csv file itself, so this confuses me...

What I have done to try and solve this myself: - Googling of course, did not find anything :-/ - Asking a discord help channel, on multiple occasions - Asking IIRC help channel, on multiple occasions

Anyways, would really appreciate any help on this problem!!! Thank you in advance :)

MCVE

from dask.distributed import Client
import dask.dataframe as dd
import pandas as pd
from fbprophet import Prophet

pd.options.mode.chained_assignment = None
client = Client(n_workers=2, threads_per_worker=4, processes=False, memory_limit='4GB')
csv_file = 'provide_your_own_csv_file_here.csv'
df = dd.read_csv(csv_file, parse_dates=['Time (UTC)'])
df = df.set_index('Time (UTC)')
df['y'] = df[['a','b']].mean(axis=1)
m = Prophet(daily_seasonality=True)
m.fit(df)
# ERROR: Truth of Delayed objects is not supported
edesz
  • 11,756
  • 22
  • 75
  • 123
Nubonix
  • 71
  • 7

2 Answers2

0

Unfortunately Prophet doesn't support Dask dataframes today.

The example that you refer to shows using Dask to accelerate Prophet's fitting on Pandas dataframes. Dask Dataframe is only one way that people use Dask.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
0

As already suggested, one approach is to use dask.delayed with a pandas DataFrame, and skip dask.dataframe.

You could use a simplified version of the load-clean-analyze pipeline shown for custom computations using Dask.

Here is one possible approach based on this type of custom pipeline, using a small dataset (to create a MCVE) - every step in the pipeline will be delayed

Imports

import numpy as np
import pandas as pd
from dask import delayed
from dask.distributed import Client
from fbprophet import Prophet

Generate some data in a .csv, with column names Time (UTC), a and b

def generate_csv(nrows, fname):
    df = pd.DataFrame(np.random.rand(nrows, 2), columns=["a", "b"])
    df["Time (UTC)"] = pd.date_range(start="1850-01-01", periods=nrows)
    df.to_csv(fname, index=False)

First write the load function from the pipeline, to load the .csv with Pandas, and delay its execution using the dask.delayed decorator

  • might be good to use read_csv with nrows to see how the pipeline performs on a subset of the data, rather than loading it all
  • this will return a dask.delayed object and not a pandas.DataFrame
@delayed
def load_data(fname, nrows=None):
    return pd.read_csv(fname, nrows=nrows)

Now create the process function, to process data using pandas, again delayed since its input is a dask.delayed object and not a pandas.DataFrame

@delayed
def process_data(df):
    df = df.rename(columns={"Time (UTC)": "ds"})
    df["y"] = df[["a", "b"]].mean(axis=1)
    return df

Last function - this one will train fbprophet on the data (loaded from the .csv and processed, but delayed) to make a forecast. This analyze function is also delayed, since one of its inputs is a dask.delayed object

@delayed
def analyze(df, horizon):
    m = Prophet(daily_seasonality=True)
    m.fit(df)
    future = m.make_future_dataframe(periods=horizon)
    forecast = m.predict(future)
    return forecast

Run the pipeline (if running from a Python script, requires __name__ == "__main__")

  • the output of the pipeline (a forecast by fbprophet) is stored in a variable result, which is delayed
    • when this output is computed, this will generate a pandas.DataFrame (corresponding to the output of a forecast by fbprophet), so it can be evaluated using result.compute()
if __name__ == "__main__":
    horizon = 8
    num_rows_data = 40
    num_rows_to_load = 35
    csv_fname = "my_file.csv"

    generate_csv(num_rows_data, csv_fname)

    client = Client()  # modify this as required

    df = load_data(csv_fname, nrows=num_rows_to_load)
    df = process_data(df)
    result = analyze(df, horizon)
    forecast = result.compute()

    client.close()

    assert len(forecast) == num_rows_to_load + horizon
    print(forecast[["ds", "yhat", "yhat_lower", "yhat_upper"]].head())

Output

          ds      yhat  yhat_lower  yhat_upper
0 1850-01-01  0.330649    0.095788    0.573378
1 1850-01-02  0.493025    0.266692    0.724632
2 1850-01-03  0.573344    0.348953    0.822692
3 1850-01-04  0.491388    0.246458    0.712400
4 1850-01-05  0.307939    0.066030    0.548981
edesz
  • 11,756
  • 22
  • 75
  • 123