1

In R, it is possible to execute multiple linear regression like

temp = lm(log(volume_1[11:62])~log(price_1[11:62])+log(volume_1[10:61]))

In Python, it is possible to execute multiple linear regression with R style formula so I thought the code below should work just as well,

import statsmodels.formula.api as smf
import pandas as pd
import numpy as np

rando = lambda x: np.random.randint(low=1, high=100, size=x)

df = pd.DataFrame(data={'volume_1': rando(62), 'price_1': rando(62)})

temp = smf.ols(formula='np.log(volume_1)[11:62] ~ np.log(price_1)[11:62] + np.log(volume_1)[10:61]', 
               data=df) 
# np.log(volume_1)[10:61] express the lagged volume

but I get the error

PatsyError: Number of rows mismatch between data argument and volume_1[11:62] (62 versus 51)
volume_1[11:62] ~ price_1[11:62] + volume_1[10:61]

I guess it is not possible to regress just part of the rows in columns, cuz the data = df has 62 rows, and the other variables have 51 rows.

Is there any convenient way to do regression like R?

df type is pandas Dataframe and the column names are volume_1, price_1

jtweeder
  • 751
  • 3
  • 19
  • It appears that the error is coming from [Patsy](https://patsy.readthedocs.io/en/latest/R-comparison.html) which is used for the R like formula syntax in python. If you were using the same subset of rows for each term it would be easy just to use the same slice of df, but that is not the case in your example. – jtweeder Oct 25 '18 at 15:26

1 Answers1

0

Using an example from a github question in the patsy repository, this would be the way to get your lag column to work correctly.

import statsmodels.formula.api as smf
import pandas as pd
import numpy as np

rando = lambda x: np.random.randint(low=1, high=100, size=x)

df = pd.DataFrame(data={'volume_1': rando(62), 'price_1': rando(62)})

def lag(x, n):
    if n == 0:
        return x
    if isinstance(x,pd.Series):
        return x.shift(n)

    x = x.astype('float')
    x[n:] = x[0:-n]
    x[:n] = np.nan
    return x

temp = smf.ols(formula='np.log(volume_1) ~ np.log(price_1) + np.log(lag(volume_1,1))', 
               data=df[11:62]) 
jtweeder
  • 751
  • 3
  • 19
  • Why not use .shift() method on dataframe? – Evgeny Oct 25 '18 at 15:50
  • `temp = smf.ols(formula='np.log(volume_1) ~ np.log(price_1) + np.log(volume_1.shift())', data=df[11:62])` also works. But in cases where `data` is not provided as a dataframe will cause issues. Data can be any dict-like object with the variables in the formula. So `{'volume_1': [values], 'price_1': [values]}` could be used as the data argument and would fail if only `.shift()` was used. – jtweeder Oct 25 '18 at 16:44
  • ... and the reason you are trying to catch orbitrary type is the behaviour of R code? – Evgeny Oct 25 '18 at 17:44
  • Yes... the R [lm function](https://www.rdocumentation.org/packages/stats/versions/3.5.1/topics/lm) allows for multiple different datatypes to passed to the `data` argument. So using the lag function as above allows `smf.ols` to be used just like in R. – jtweeder Oct 25 '18 at 18:11