5

I'm trying to parallelize time series forecasting in python using dask. The format of the data is that each time series is a column and they have a common index of monthly dates. I have a custom forecasting function that returns a time series object with the fitted and forecasted values. I want to apply this function across all columns of a dataframe (all time series) and return a new dataframe with all these series to be uploaded to a DB. I've gotten the code to work by running:

data = pandas_df.copy()
ddata = dd.from_pandas(data, npartitions=1)
res = ddata.map_partitions(lambda df: df.apply(forecast_func, 
    axis=0)).compute(get=dask.multiprocessing.get)

My question is, is there a way in Dask to partition by column instead of row, since in this use case I need to keep the ordered time index as is for the forecasting function to work correctly.

If not, how would I re-format the data to allow efficient large-scale forecasting to be possible, and still return the data in the format I need to then push to a DB?

example of data format

Davis
  • 163
  • 2
  • 10
  • 1
    Hi Davis, do you mind to share the improvement on performance you're having? Which algorithm are you using? I did something similar but my timeseries where stacked vertically rather than horizontally. – rpanai Mar 22 '18 at 17:07
  • 1
    Sure, right now I'm using https://facebook.github.io/prophet/ Generally, on 1 time series it was taking 3 seconds to fit & predict. When I was just using apply on 1 core, it would take about an hour for 1000 time series. So the time taken is basically reduced by a factor of the cores or workers you have it seems (haven't tested a ton with different cores vs workers), so with the 3 cores/workers it was taking around 15 minutes to fit and predict for 1000 time series – Davis Mar 22 '18 at 17:44
  • 1
    I should note that probably half the compute time of the forecasting function is spent just messing with the index to get the data to and from prophet in the format I wanted. But I wanted to create a general distributed framework for any forecasting function, so you could probably make it faster if you didn't need to do that. All this method requires is for you to have your dataframe in the format shown, and have your function take as input as well as return a single time series object (although I'm not sure if different length series will break the dataframe combine functions) – Davis Mar 22 '18 at 18:23
  • 1
    Thank you, I was using the same algorithm, but my performance were pretty bad. I ended up using multiprocessing. I will try to follow your way to see if things are going to improve or not. About cores/workers keep in mind that fbprophet use pystan and so you can't use more workers than cores (see GIL). I'm looking forward to see your framework (if is on github please share the link). Only one note: have you think about how to eventual add [regressors](https://facebook.github.io/prophet/docs/seasonality_and_holiday_effects.html) having timeseries as columns? – rpanai Mar 22 '18 at 18:31
  • 1
    no problem, cool yeah let me know. It's not on github yet and probably won't be for a while, as it's all internal right now, and I just started working on it. For x regressors, I've considered it, and it's not a problem if you only care about the predictions of y, but if you want to see what the model predicted for the x's too to understand your model, then I agree that will complicate things. I'm still trying to conceptualize what a general framework would even look like right now. I'm not convinced it's ideal in the current state – Davis Mar 22 '18 at 19:27
  • 1
    @user32185 I've released a very early pre-alpha version of the package, called magi. I could use some help on adding x regressor and cross validation support as well as additional wrappers for pyflux and statsmodel libraries. Check it out: https://github.com/DavisTownsend/magi – Davis May 14 '18 at 22:05
  • I was wondering about why you had the timeseries in columns. The fact that even you are working on M4 datasets explains everything :) Thank you – rpanai May 17 '18 at 16:36

2 Answers2

4

Thanks for the help, i really appreciate it. I've used the dask.delayed solution and it's working really well, it takes about 1/3 of the time just using a local cluster.

For anyone interested the solution I've implemented:

from dask.distributed import Client, LocalCluster
import pandas as pd
import dask

cluster = LocalCluster(n_workers=3,ncores=3)
client = Client(cluster)

#get list of time series back
output = []
for i in small_df:
    forecasted_series = dask.delayed(custom_forecast_func)(small_df[i])
    output.append(forecasted_series)

total = dask.delayed(output).compute()

#combine list of series into 1 dataframe
full_df = pd.concat(total,ignore_index=False,keys=small_df.columns,names=['time_series_names','Date'])
final_df = full_df.to_frame().reset_index()
final_df.columns = ['time_series_names','Date','value_variable']
final_df.head()

This gives you the melted dataframe structure so if you want the series to be the columns you can transform it with

pivoted_df = final_df.pivot(index='Date', columns='time_series_names', values='value_variable')

small_df is in this format in pandas dataframe with Date being the index

Davis
  • 163
  • 2
  • 10
  • I tried this method using `dask==2.15.0` and `fbprophet==0.6` with `small_df` containing 25 time series and 1000 observations per series. It worked and the computation time with `LocalCluster` was about 2/3 of the time without using `dask` (i.e. the `dask.delayed` approach was approx. 33% faster). Next, I increased 25 to 50 time series and again with `LocalCluster` running on my laptop (with 4 cores and 8GB) ran out of memory very quickly. Do you have experience with trying to extend your `dask.delayed` **+** `fbprophet` approach to using a `large_df` comprising 1000s of time series? – edesz May 02 '20 at 18:55
1

Dask dataframe only partitions data by rows. See the Dask dataframe documentation

Dask array however can partition along any dimension. You have you use Numpy semantics though rather than Pandas semantics.

You can do anything you want to with dask delayed or futures. This parallel computing example given in a more generic tutorial might give you some ideas.

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • I was thinking that since all the series have the same length I could just melt the dataframe, and then partition (using chunksize=len(series) ) by the number of time series object, and then apply the function to that dask dataframe, and rename the column names afterwards. Would there be any problem with having 1000's of partitions by doing this, or issues with the ordering of the series relative to the column names? – Davis Mar 22 '18 at 15:27
  • I don't fully understand the comment (I'm not familiar with melt, or much of the pandas api). Generally I don't know of a clever way to get column-wise parallelism out of dask.dataframe. I recommend using dask array or dask delayed. Thousands of partitions is fine, but adds overhead. – MRocklin Mar 22 '18 at 16:26