Issue with Augmented Dickey-Fuller test in Python with small number of observations

Question

I want to test for stationarity on a time series (nobs = 23) and implemented the adfuller test from statsmodels.tsa.stattools.

Here are the original data:

1995-01-01      3126.0
1996-01-01      3321.0
1997-01-01      3514.0
1998-01-01      3690.0
1999-01-01      3906.0
2000-01-01      4065.0
2001-01-01      4287.0
2002-01-01      4409.0
2003-01-01      4641.0
2004-01-01      4812.0
2005-01-01      4901.0
2006-01-01      5028.0
2007-01-01      5035.0
2008-01-01      5083.0
2009-01-01      5183.0
2010-01-01      5377.0
2011-01-01      5428.0
2012-01-01      5601.0
2013-01-01      5705.0
2014-01-01      5895.0
2015-01-01      6234.0
2016-01-01      6542.0
2017-01-01      6839.0

Here’s is the custom ADF function I’m using (credit goes to this blog):

def test_stationarity(timeseries):
    print('Results of Dickey-Fuller Test:')
    dftest = adfuller(timeseries, autolag='AIC', maxlag = None)
    dfoutput = pd.Series(dftest[0:4], index=['ADF Statistic', 'p-value', '#Lags Used', 'Number of Obs Used'])
    for key, value in dftest[4].items():
        dfoutput['Critical Value (%s)' % key] = value
    print(dfoutput)

Here are the results of the ADF test on the original data:

ADF Statistic           -0.126550
p-value                  0.946729
#Lags Used               8.000000
Number of Obs Used      14.000000
Critical Value (1%)     -4.012034
Critical Value (5%)     -3.104184
Critical Value (10%)    -2.690987

The ADF statistic is larger than all of the critical values and the p-value > alpha 0.05 indicating the series is not stationary so I perform a first differencing of the data. Here’s the differencing function and the results of the ADF test:

def difference(dataset):
    diff = list()
    for i in range(1, len(dataset)):
        value = dataset[i] - dataset[i - 1]
        #print(value)
        diff.append(value)
    return pd.Series(diff)


ADF Statistic           -1.169799
p-value                  0.686451
#Lags Used               9.000000
Number of Obs Used      12.000000
Critical Value (1%)     -4.137829
Critical Value (5%)     -3.154972
Critical Value (10%)    -2.714477

The ADF statistic and p-value both improve but the series still isn’t stationary so I perform a second differencing, again here are the results:

ADF Statistic           -0.000000
p-value                  0.958532
#Lags Used               9.000000
Number of Obs Used      11.000000
Critical Value (1%)     -4.223238
Critical Value (5%)     -3.189369
Critical Value (10%)    -2.729839

After a second differencing of the data, ADF test statistic becomes -0.0000 (which is puzzling given that a print() of the unrounded value returns -0.0 but either way implies that there’s some significant digit other than zero somewhere) and the p-value is now worse than it was in the beginning. I also receive this warning:

RuntimeWarning: divide by zero encountered in double_scalars
  return np.dot(wresid, wresid) / self.df_resid.

A grid search of the p, d, q values returns an ARIMA(1, 1, 0) model but I assumed that a second differencing would still be necessary since first differencing did not achieve it.

I suspect the strange test statistic and p-value are due to the small sample size and high # of lags used by the ADF test’s default setting (maxlag = None). I understand that when maxlag is set to None it uses the formula int(np.ceil(12. * np.power(nobs/100., 1/4.))).

Is this appropriate? If not, is there any workaround for data sets with small numbers of observations or a rule of thumb for manually setting the maxlag value in the ADF function to avoid what appears to be an erroneous test statistic. I searched here, here, and here but couldn’t find a solution.

I’m using statsmodels version 0.8.0.

Hi DummieCoder - i'll try to help you later, but i'd recommend also posting on quant stackexchange: https://quant.stackexchange.com/ — rafaelc, Jul 11 '18 at 20:08
Thanks Rafael! I assume you're recommending to post on quant.stackexchange since the question is about time series, which is an important part of financial forecasting? I just want to make sure it's an acceptable practice to post the same question on multiple forums. How will it work if someone answer the question on one site and not the other? — DummieCoder, Jul 12 '18 at 21:02

score 0 · Answer 1 · answered May 14 '21 at 15:02

The issue you are seeing is that the maximum lag length is too high. First, your data has a strong trend so you should initially include trend="ct". This improves the test statistic but it is not enough. When you different, the differenced data has a non-zero mean and so the trend should be "c". This still does not reject, and so a double difference is needed. The double difference is probably needed because the series is persistent but also because ADF tests have low power.

You should set the maximum lags to be less than the square root of the sample size. What is happening here is that too many lags are being used which reduces the effective sample size so that the model fit is near perfect. The produces a spuriously high number of lags being chosen.

from arch.unitroot import ADF
import pandas as pd
import numpy as np

y = [3126.0, 3321.0, 3514.0, 3690.0, 3906.0, 4065.0, 4287.0, 
     4409.0, 4641.0, 4812.0, 4901.0, 5028.0, 5035.0, 5083.0,
     5183.0, 5377.0, 5428.0, 5601.0, 5705.0, 5895.0, 6234.0,
     6542.0, 6839.0]
y = pd.Series(y)

max_lags = int(np.sqrt(y.shape[0]))
print(f"max_lags: {max_lags}")
ADF(y, trend="ct", max_lags=max_lags).summary()

The outputs

max_lags: 4

   Augmented Dickey-Fuller Results
=====================================
Test Statistic                 -2.009
P-value                         0.596
Lags                                2
-------------------------------------

Trend: Constant and Linear Time Trend
Critical Values: -4.50 (1%), -3.66 (5%), -3.27 (10%)
Null Hypothesis: The process contains a unit root.
Alternative Hypothesis: The process is weakly stationary.

Next, the difference,

ADF(y.diff().dropna(), trend="c", max_lags=max_lags).summary()

which returns

   Augmented Dickey-Fuller Results
=====================================
Test Statistic                 -2.224
P-value                         0.198
Lags                                0
-------------------------------------

Trend: Constant
Critical Values: -3.79 (1%), -3.01 (5%), -2.65 (10%)
Null Hypothesis: The process contains a unit root.
Alternative Hypothesis: The process is weakly stationary.

The null is not rejected. Differencing one more time, this time with trend="n", finally produces a very stationary series.

ADF(y.diff().diff().dropna(), trend="n", max_lags=max_lags).summary()

   Augmented Dickey-Fuller Results
=====================================
Test Statistic                 -7.346
P-value                         0.000
Lags                                0
-------------------------------------

Trend: No Trend
Critical Values: -2.69 (1%), -1.96 (5%), -1.61 (10%)
Null Hypothesis: The process contains a unit root.
Alternative Hypothesis: The process is weakly stationary.

The challenge is that one cannot completely rely on an ADF test when the time series is short. The difference does not look especially non-stationary, for example.

Issue with Augmented Dickey-Fuller test in Python with small number of observations

1 Answers1