0

I'm new to KNIME and trying to use ARIMA for extrapolation of my time series data. But I've failed to make ARIMA Predictor to do it's work.

Input data are of the following format

year,cv_diff
2011,-4799.099999999977
2012,60653.5
2013,64547.5
2014,60420.79999999993

And I would like to predict values for example for 2015 and 2016 years.

I'm using String to Date/Time node to convert year to date. In ARIMA Learner I can choose only cv_diff field. And this is the first question: for option 'Column containing univariate time series' should I set year column or variable that I'm going to predict? But in my case I have only one option - cv_diff variable. After that I connect Learner's output with ARIMA Predictor's input and execute. Execution is failing with ' ERROR ARIMA Predictor 2:3 Execute failed: The column with the defined time series was not found. Please configure the node anew.'

Help me to understand which variable should I set for Learner and Predictor? Should it be non-timeseries variable? And how then Arima nodes will understand which column to use as time series?

Deil
  • 492
  • 4
  • 14

2 Answers2

1

You should set the cv_diff as the time series variable and connect the input to the predictor too. (And do not try to set too large values for the parameters as with so little data points, learning will not work.)

Here is an example:

Predictor configuration with visualization

Gábor Bakos
  • 8,982
  • 52
  • 35
  • 52
  • @Gabbor Bakos Thank you! Just noticed your comment! Yes...big parameters will not work for such a small dataset. Maybe you can tell me about grey area around predicted line? Is it area where there is a probability to get next prediction? – Deil Jun 26 '17 at 18:48
  • Yes, the grey area is where the actual data points are with `.95` probability. (That confidence interval can be adjusted in the visualization and also in the view.) In visualization multiple models can be shown (though not with the KNIME ARIMA Learner). – Gábor Bakos Jun 26 '17 at 19:00
  • Thank you one more time. – Deil Jun 26 '17 at 19:05
0

Finally, I've figured it out. Option 'Column containing univariate time series' for ARIMA Learner node seems little bit confusing especially for those unfamiliar with time series analysis. I should't have provided any time series field explicitly, because ARIMA treats variable on which it is going to make prediction as collected in equal time intervals and it doesn't matter what kind of intervals they are.

I've found a good explanation of what 'univariate time series' means

The term "univariate time series" refers to a time series that consists of single (scalar) observations recorded sequentially over equal time increments. Some examples are monthly CO2 concentrations and southern oscillations to predict el nino effects. Although a univariate time series data set is usually given as a single column of numbers, time is in fact an implicit variable in the time series. If the data are equi-spaced, the time variable, or index, does not need to be explicitly given. The time variable may sometimes be explicitly used for plotting the series. However, it is not used in the time series model itself.

So, I should choose cv_diff variable for both Learner and Predictor and do not provide any timestamps or any other time related columns.

One more thing that I didn't understand. That I should train on some series of data and then provide another SERIES for which I want predictions. That is little bit different from other Machine Learning workflows when you need to provide only new data and there is no notion of series at all.

Deil
  • 492
  • 4
  • 14