3

I have multiple time series that I would like to forecast with GluonTS, then concatenate so my result is a pandas data frame with the column headers date, y (the target), series (the series number).

The problem is that GluonTS produces a generator. I can look at each series with next(iter(forecast_id)), but I would like to stack all of the forecasts together to make it easier to export as a csv.

How can I stack the forecasts from all series into one pandas dataframe?

import pandas as pd 
import numpy as np 
import mxnet as mx
from mxnet import gluon
from gluonts.dataset import common
from gluonts.model.baseline import SeasonalNaivePredictor
from gluonts.trainer import Trainer
from gluonts.evaluation.backtest import make_evaluation_predictions
from gluonts.dataset.util import to_pandas

N = 10  # number of time series
T = 100  # number of timesteps
prediction_length = 24
custom_dataset = np.random.normal(size=(N, T))
start = pd.Timestamp("01-01-2019", freq='1H') 

# train dataset: cut the last window of length "prediction_length", add "target" and "start" fields
train_ds = [{'target': x, 'start': start} for x in custom_dataset[:, :-prediction_length]]
# test dataset: use the whole dataset, add "target" and "start" fields
test_ds = [{'target': x, 'start': start} for x in custom_dataset]

predictor = SeasonalNaivePredictor(
    prediction_length=prediction_length,
    season_length=24,
    freq='1H'
)

forecast_it, ts_it = make_evaluation_predictions(
    dataset=test_ds,  # test dataset
    predictor=predictor,  # predictor
    num_samples=100,  # number of sample paths we want for evaluation
)

test_entry = next(iter(forecast_it))
print(test_entry)
> gluonts.model.forecast.SampleForecast(freq="1H", info=None, item_id=None, samples=numpy.array([[-1.078548550605774, 0.3002452254295349, 0.1025903970003128, -1.6613410711288452, -0.2776057720184326, -0.020864564925432205, -1.9355241060256958, 1.0598571300506592, 0.16316552460193634, -0.9441472887992859, 2.7307169437408447, -0.35861697793006897, 0.22022956609725952, 0.8052476048469543, -1.1194337606430054, 0.05703512206673622, -1.1357367038726807, -2.544445037841797, 1.2661969661712646, 0.17130693793296814, 0.8647393584251404, -1.9620181322097778, -0.5465423464775085, 0.26572829484939575]], numpy.dtype("float32")), start_date=pandas.Timestamp("2019-01-04 04:00:00", freq="H"))
Alex
  • 2,603
  • 4
  • 40
  • 73

1 Answers1

4

You can unpack an entry like so:

def sample_df(forecast):
    samples = forecast.samples
    ns, h = samples.shape
    dates = pd.date_range(forecast.start_date, freq=forecast.freq, periods=h)
    return pd.DataFrame(samples.T, index=dates)

This is just grabbing various properties from the SampleForecast.

It starts with the forecast samples, an ndarray with a row per sample and a column per time period. The number of columns gives a forecasting horizon h which, with the start_date and freq properties, can be given to pd.date_range to construct the forecast dates.

And then the samples are transposed, giving a row per time period and a column for each sample. This can be indexed with the reconstructed dates, and you're all set for one sample.

sample_df(test_entry)
#                             0
# 2019-01-04 04:00:00  0.748107
# 2019-01-04 05:00:00  1.620660
# 2019-01-04 06:00:00 -0.648520
# 2019-01-04 07:00:00  0.277669
# 2019-01-04 08:00:00 -1.010820
# ...

To process all of your results, you can run this method independently over each DataFrame and put them all together with pd.concat.

parts = [sample_df(entry).assign(entry=i)
         for i, entry in enumerate(forecast_it)]
pd.concat(parts)
#                             0  entry
# 2019-01-04 04:00:00  0.748107      0
# 2019-01-04 05:00:00  1.620660      0
# 2019-01-04 06:00:00 -0.648520      0
# 2019-01-04 07:00:00  0.277669      0
# 2019-01-04 08:00:00 -1.010820      0
# ...                       ...    ...
# 2019-01-04 23:00:00  0.999718      9
# 2019-01-05 00:00:00  0.027250      9
# 2019-01-05 01:00:00  2.030961      9
# 2019-01-05 02:00:00 -1.414711      9
# 2019-01-05 03:00:00  0.737124      9

This also tags each DataFrame with an entry to mark which of the forecast results it came from.

You can also use pd.DataFrame.melt to convert from one column per sample to a long form with a column to identify the sample. Some renames at the end can make everything pretty for later analysis.

long_form = pd.concat(parts).reset_index().melt(['index', 'entry'])
long_form.rename(columns={
    'index': 'ts',
    'variable': 'sample',
    'value': 'forecast',
})
#                      ts  entry sample     value
# 0   2019-01-04 04:00:00      0      0  0.748107
# 1   2019-01-04 05:00:00      0      0  1.620660
# 2   2019-01-04 06:00:00      0      0 -0.648520
# 3   2019-01-04 07:00:00      0      0  0.277669
# 4   2019-01-04 08:00:00      0      0 -1.010820
# ..                  ...    ...    ...       ...
# 235 2019-01-04 23:00:00      9      0  0.999718
# 236 2019-01-05 00:00:00      9      0  0.027250
# 237 2019-01-05 01:00:00      9      0  2.030961
# 238 2019-01-05 02:00:00      9      0 -1.414711
# 239 2019-01-05 03:00:00      9      0  0.737124

Note: This code should work for any number of samples, giving a column for each (or rows in the long form). But the results here only have one sample. What gives?

I read through the relevant code, and the RepresentablePredictor base class will only generate one sample, no matter what you ask for from make_evaluation_predictions. The parameter is just never passed along. This base class is used for non-Gluon forecasting methodologies, so I guess they just expect those to be non-random and only suitable for generating a single example. Or it's a bug.

mcskinner
  • 2,620
  • 1
  • 11
  • 21