Pandas Rolling OLS Bug with Version 0.12.0

Question

I have the following example data for performing a rolling OLS calculation (here I am doing it from the debugger):

(Pdb) rhs
['Yield']

(Pdb) lhs
'Returns'

(Pdb) min_periods
12

(Pdb) window
60

(Pdb) intercept
True

(Pdb) print df[rhs].to_string()
                 Yield
EndOfMonthDate        
2001-08-31      0.0561
2001-09-28      0.0360
2001-10-31      0.0500
2001-11-30      0.0500
2001-12-31      0.0500
2002-01-31      0.0191
2002-02-28      0.0563
2002-03-29      0.0557
2002-04-30      0.0600
2002-05-31      0.0569
2002-06-28      0.0571
2002-07-31      0.0450
2002-08-30      0.0416
2002-09-30      0.0360
2002-10-31      0.0395
2002-11-29      0.0422
2010-05-31      0.0323
2010-06-30      0.0311
2010-07-30      0.0300
2010-07-30      0.0300
2010-08-31      0.0251
2010-08-31      0.0251
2010-09-30      0.0250
2010-10-29      0.0271
2010-11-30      0.0287
2010-12-31      0.0347
2010-12-31      0.0347
2012-01-31      0.0201
2012-02-29      0.0197
2012-03-30      0.0220
2012-04-30      0.0199
2012-07-31      0.0141

(Pdb) print df[lhs].to_string()
2001-08-31        -0.005519
2001-09-28        -0.350356
2001-10-31        10.003698
2001-11-30         3.230476
2001-12-31        -3.776050
2002-01-31         9.153807
2002-02-28        -4.175085
2002-03-29        46.890701
2002-04-30       -15.747041
2002-05-31         2.797472
2002-06-28        -1.000851
2002-07-31       -13.398200
2002-08-30        -1.707745
2002-09-30         2.054250
2002-10-31         0.000620
2002-11-29        -9.790426
2010-05-31         0.000012
2010-06-30         0.000012
2010-07-30        -1.745182
2010-07-30        -0.000006
2010-08-31       -20.779633
2010-08-31         0.000000
2010-09-30        -0.000006
2010-10-29        -0.000012
2010-11-30        -0.000006
2010-12-31        30.165554
2010-12-31        -2.549851
2012-01-31        -6.892008
2012-02-29        -1.638216
2012-03-30         4.295588
2012-04-30        -7.094216
2012-07-31        -0.041252

When I attempt a rolling OLS:

(Pdb) pandas.ols(y=df[lhs], x=df[rhs], window=window, min_periods=min_periods, intercept=intercept)
*** TypeError: unsupported operand type(s) for +: 'slice' and 'int'

But if just try a regular OLS for the whole data range it seems fine:

(Pdb) pandas.ols(y=df[lhs], x=df[rhs], intercept=intercept)

-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <Yield> + <intercept>

Number of Observations:         38
Number of Degrees of Freedom:   2

R-squared:         0.0226
Adj R-squared:    -0.0046

Rmse:             12.5182

F-stat (1, 36):     0.8321, p-value:     0.3677

Degrees of Freedom: model 1, resid 36

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
         Yield   146.6702   160.7874       0.91     0.3677  -168.4732   461.8135
     intercept    -4.6083     6.0652      -0.76     0.4523   -16.4961     7.2795
---------------------------------End of Summary---------------------------------

Is this a known bug with pandas.ols in the case of trying a rolling regression? The data is small and obviously has no defects that should prevent a rolling 12-to-60 observation regression from working in this case.

The full traceback I get when not looking in the debugger:

  File "properties.pyx", line 31, in pandas.lib.cache_readonly.__get__ (pandas/lib.c:28841)
  File "/opt/epd/7.3-2_pandas0.12/lib/python2.7/site-packages/pandas/stats/ols.py", line 656, in beta
    return DataFrame(self._beta_raw,
  File "properties.pyx", line 31, in pandas.lib.cache_readonly.__get__ (pandas/lib.c:28841)
  File "/opt/epd/7.3-2_pandas0.12/lib/python2.7/site-packages/pandas/stats/ols.py", line 775, in _beta_raw
    beta, indices, mask = self._rolling_ols_call
  File "properties.pyx", line 31, in pandas.lib.cache_readonly.__get__ (pandas/lib.c:28841)
  File "/opt/epd/7.3-2_pandas0.12/lib/python2.7/site-packages/pandas/stats/ols.py", line 789, in _rolling_ols_call
    return self._calc_betas(self._x_trans, self._y_trans)
  File "/opt/epd/7.3-2_pandas0.12/lib/python2.7/site-packages/pandas/stats/ols.py", line 803, in _calc_betas
    cum_xx = self._cum_xx(x)
  File "/opt/epd/7.3-2_pandas0.12/lib/python2.7/site-packages/pandas/stats/ols.py", line 865, in _cum_xx
    x_slice = slicer(x, date)
  File "/opt/epd/7.3-2_pandas0.12/lib/python2.7/site-packages/pandas/stats/ols.py", line 856, in slicer
    return df.values[i:i + 1, :]
TypeError: unsupported operand type(s) for +: 'slice' and 'int'

Added

The offending code seems to be within this function from ols.py in Pandas 0.12.

def _cum_xx(self, x):
    dates = self._index
    K = len(x.columns)
    valid = self._time_has_obs
    cum_xx = []

    slicer = lambda df, dt: df.truncate(dt, dt).values
    if not self._panel_model:
        _get_index = x.index.get_loc

        def slicer(df, dt):
            i = _get_index(dt)
            return df.values[i:i + 1, :]

    last = np.zeros((K, K))

    for i, date in enumerate(dates):
        if not valid[i]:
            cum_xx.append(last)
            continue

        x_slice = slicer(x, date)
        xx = last = last + np.dot(x_slice.T, x_slice)
        cum_xx.append(xx)

    return cum_xx

_get_index is a proxy for x.index.get_loc which says that it can return a slice object. But the code below assumes that the value i obtained this way is an integer, so that i+1 makes sense.

I found the source for get_loc. It turns out that x.index.get_loc is a proxy for x.index._engine.get_loc. In my case, the _engine_type of the relevant index at the time of the error is just ObjectEngine which is defined in this source location and get_loc is defined there:

cpdef get_loc(self, object val):
    if is_definitely_invalid_key(val):
        raise TypeError

    if self.over_size_threshold and self.is_monotonic:
        if not self.is_unique:
            return self._get_loc_duplicates(val)
        values = self._get_index_values()
        loc = _bin_search(values, val) # .searchsorted(val, side='left')
        if util.get_value_at(values, loc) != val:
            raise KeyError(val)
        return loc

    self._ensure_mapping_populated()
    if not self.unique:
        return self._get_loc_duplicates(val)

    self._check_type(val)

    try:
        return self.mapping.get_item(val)
    except TypeError:
        raise KeyError(val)

I'm looking into when/why get_loc returns a slice for me (definitely no duplicates in the index, which is what the docs suggest is the only way this can happen). In the mean time, any advice along these lines would be helpful.

could it be that your index is not numeric or so? – PlagTag Oct 30 '17 at 20:13 — PlagTag, Oct 30 '17 at 20:13

Pandas Rolling OLS Bug with Version 0.12.0

0 Answers0