2

I have been using the scikits.statsmodels OLS predict function to forecast fitted data but would now like to shift to using Pandas.

The documentation refers to OLS as well as to a function called y_predict but I can't find any documentation on how to use it correctly.

By way of example:

exogenous = {
    "1998": "4760","1999": "5904","2000": "4504","2001": "9808","2002": "4241","2003": "4086","2004": "4687","2005": "7686","2006": "3740","2007": "3075","2008": "3753","2009": "4679","2010": "5468","2011": "7154","2012": "4292","2013": "4283","2014": "4595","2015": "9194","2016": "4221","2017": "4520"}
endogenous = {
    "1998": "691", "1999": "1580", "2000": "80", "2001": "1450", "2002": "555", "2003": "956", "2004": "877", "2005": "614", "2006": "468", "2007": "191"}

import numpy as np
from pandas import *

ols_test = ols(y=Series(endogenous), x=Series(exogenous))

However, while I can produce a fit:

>>> ols_test.y_fitted
1998     675.268299
1999     841.176837
2000     638.141913
2001    1407.354228
2002     600.000352
2003     577.521485
2004     664.681478
2005    1099.611292
2006     527.342854
2007     430.901264

Prediction produces nothing different:

>>> ols_test.y_predict
1998     675.268299
1999     841.176837
2000     638.141913
2001    1407.354228
2002     600.000352
2003     577.521485
2004     664.681478
2005    1099.611292
2006     527.342854
2007     430.901264

In scikits.statsmodels one would do the following:

import scikits.statsmodels.api as sm
...
ols_model = sm.OLS(endogenous, np.column_stack(exogenous))
ols_results = ols_mod.fit()
ols_pred = ols_mod.predict(np.column_stack(exog_prediction_values))

How do I do this in Pandas to forecast the endogenous data out to the limits of the exogenous?

UPDATE: Thanks to Chang, the new version of Pandas (0.7.3) now has this functionality as standard.

piRSquared
  • 285,575
  • 57
  • 475
  • 624
Turukawa
  • 155
  • 2
  • 11
  • hi, will you mind to give an example on how to use the ols.predict? say you have three independent variables,thus three betas[b1, b2, b3] now you want to use [x1, x2, x3] to predict a y – tesla1060 Mar 24 '13 at 12:33

1 Answers1

2

is your issue how to get the predicted y values of your regression? Or is it how to use the regression coefficients to get predicted y values for a different set of samples for the exogenous variables? pandas y_predict and y_fitted should give you the same value and both should give you the same values as the predict method in scikits.statsmodels.

If you're looking for the regression coefficients, do ols_test.beta

Chang She
  • 16,692
  • 8
  • 40
  • 25
  • I would like predicted y values for 2008 to 2017, which I can get with scikits.statsmodels predict, but I have no idea how to get it with Pandas. – Turukawa Apr 01 '12 at 18:54
  • Gotcha. If you want to use the pandas ols function, you can do (ols_result.beta['x'] * exog_2008_2017).sum() + ols_result.beta['intercept'] for now. – Chang She Apr 07 '12 at 20:40
  • I've opened a Github issue about it here: https://github.com/pydata/pandas/issues/1008 to provide a function that replicates the statsmodels functionality – Chang She Apr 07 '12 at 21:03
  • Longer term we plan to move the pandas OLS code (which has NA handling and moving window capability) into statsmodels and providing a consistent interface – Wes McKinney Apr 08 '12 at 16:33