48

Scikit-learn utilizes a very convenient approach based on fit and predict methods. I have time-series data in the format suited for fit and predict.

For example I have the following Xs:

[[1.0, 2.3, 4.5], [6.7, 2.7, 1.2], ..., [3.2, 4.7, 1.1]]

and the corresponding ys:

[[1.0], [2.3], ..., [7.7]]

These data have the following meaning. The values stored in ys form a time series. The values in Xs are corresponding time dependent "factors" that are known to have some influence on the values in ys (for example: temperature, humidity and atmospheric pressure).

Now, of course, I can use fit(Xs,ys). But then I get a model in which future values in ys depend only on factors and do not dependend on the previous Y values (at least directly) and this is a limitation of the model. I would like to have a model in which Y_n depends also on Y_{n-1} and Y_{n-2} and so on. For example I might want to use an exponential moving average as a model. What is the most elegant way to do it in scikit-learn

ADDED

As it has been mentioned in the comments, I can extend Xs by adding ys. But this way has some limitations. For example, if I add the last 5 values of y as 5 new columns to X, the information about time ordering of ys is lost. For example, there is no indication in X that values in the 5th column follows value in the 4th column and so on. As a model, I might want to have a linear fit of the last five ys and use the found linear function to make a prediction. But if I have 5 values in 5 columns it is not so trivial.

ADDED 2

To make my problem even more clear, I would like to give one concrete example. I would like to have a "linear" model in which y_n = c + k1*x1 + k2*x2 + k3*x3 + k4*EMOV_n, where EMOV_n is just an exponential moving average. How, can I implement this simple model in scikit-learn?

sfjac
  • 7,119
  • 5
  • 45
  • 69
Roman
  • 124,451
  • 167
  • 349
  • 456
  • you can simply add `[np.nan] + ys[:-1]` as one of the factors – behzad.nouri Dec 30 '13 at 14:11
  • behzad.nouri, I have extended my question as a reply to your comment. – Roman Dec 30 '13 at 14:17
  • I think the best thing to do is to try several different simple models, including what @behzad.nouri suggested, and use cross-validation to see if any one model performs any better than the others and gain an intuition. It might be that the time exact time information is not as important as you think. I would use the extended Xs vector idea in a neural network, and see if that worked. In that approach you could sort of keep your time information, even though the n.n. would be kind of a black box. – cjohnson318 Dec 30 '13 at 14:28
  • Is your data hosted anywhere? Could I look at it? – cjohnson318 Dec 30 '13 at 14:29
  • Unfortunately the date are not hosted, and the above example is just an example (the real data have a more complicated structure). – Roman Dec 30 '13 at 15:33
  • It looks like you're trying to do a auto regressive model. You can use Statsmodels ARMA or ARIMA packages or add some points to your data set related to previous points in time. – Ivan Apr 29 '15 at 11:35
  • Hi, i would appreciate if anyone of you could help me with this. http://stackoverflow.com/questions/40350460/python-time-series-prediction-of-future-events-using-svr-module – Mahsolid Oct 31 '16 at 21:33

2 Answers2

26

According to Wikipedia, EWMA works well with stationary data, but it does not work as expected in the presence of trends, or seasonality. In those cases you should use a second or third order EWMA method, respectively. I decided to look at the pandas ewma function to see how it handled trends, and this is what I came up with:

import pandas, numpy as np
ewma = pandas.stats.moments.ewma

# make a hat function, and add noise
x = np.linspace(0,1,100)
x = np.hstack((x,x[::-1]))
x += np.random.normal( loc=0, scale=0.1, size=200 )
plot( x, alpha=0.4, label='Raw' )

# take EWMA in both directions with a smaller span term
fwd = ewma( x, span=15 )          # take EWMA in fwd direction
bwd = ewma( x[::-1], span=15 )    # take EWMA in bwd direction
c = np.vstack(( fwd, bwd[::-1] )) # lump fwd and bwd together
c = np.mean( c, axis=0 )          # average  

# regular EWMA, with bias against trend
plot( ewma( x, span=20 ), 'b', label='EWMA, span=20' )

# "corrected" (?) EWMA
plot( c, 'r', label='Reversed-Recombined' )

legend(loc=8)
savefig( 'ewma_correction.png', fmt='png', dpi=100 )

enter image description here

As you can see, the EWMA bucks the trend uphill and downhill. We can correct for this (without having to implement a second-order scheme ourselves) by taking the EWMA in both directions and then averaging. I hope your data was stationary!

sfjac
  • 7,119
  • 5
  • 45
  • 69
cjohnson318
  • 3,154
  • 30
  • 33
  • Thanks for this EWMA example Connor! But I'm curious why you don't call this a second order scheme? By offsetting and averaging two equally spaced first order schemes, you are creating a second order scheme. I know this is semantics, but I'm just pointing out that you don't get something for nothing :-) – user3654387 Nov 24 '14 at 19:18
  • I'm not sure if I called it the right thing or not. The second and third order Holt-Winters schemes that handle trends and seasonality have more complex mathematics than simply taking the average of both directions. In the future, I'd recommend using the HoltWinters() R function as it handles 2nd and 3rd order stuff as well as forecasting. Examples [here](http://www.statmethods.net/advstats/timeseries.html) – cjohnson318 Nov 24 '14 at 22:43
  • Yeah, I agree that Holt-Winters has some sophisticated capability for addressing some of the issues inherent in smoothing. Thanks for the link! I was simply calling attention to the fact that you elevated your scheme to second order by averaging the forward and backward versions. I actually see better results for this averaging than I do [using your Holt-Winters scheme](http://connor-johnson.com/2014/02/01/smoothing-with-exponentially-weighted-moving-averages/). Great blog posts, by the way! – user3654387 Nov 25 '14 at 00:45
  • 15
    It does not look like you predict anything here. You just try to fit a line into the data that you already know. – Salvador Dali Dec 26 '15 at 06:19
21

This might be what you're looking for, with regard to the exponentially weighted moving average:

import pandas, numpy
ewma = pandas.stats.moments.ewma
EMOV_n = ewma( ys, com=2 )

Here, com is a parameter that you can read about here. Then you can combine EMOV_n to Xs, using something like:

Xs = numpy.vstack((Xs,EMOV_n))

And then you can look at various linear models, here, and do something like:

from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit ( Xs, ys )
print clf.coef_

Best of luck!

Charles
  • 103
  • 1
  • 6
cjohnson318
  • 3,154
  • 30
  • 33