2

I have the following lm function in R:

in_data <- c(0.5, 0.6, 0.7)
minutes <- c(30, 60, 90)
foobar <- lm(log(in_data) ~ 0 + hours)

Questions

  • I understand the ~ operator is used to separate the left- and right-hand sides in a model formula. So in this case, does it translate to log(in_data) is dependent on 0 and hours??? I'm totally lost here, especially on how the log of a vector depends on 0 and another vector
  • If I were to attempt to port this to Pandas, what would be the most straightforward way? I tried something on the lines of:

.

import statsmodels.formula.api as sm
import numpy as np
result = sm.ols(formula="np.log(in_data) ~ 0 + minutes", data=model_data).fit()

But that threw an error:

patsy.PatsyError: Number of rows mismatch between data argument and np.log(in_data) (1 versus 4)
    np.log(in_data) ~ 0 + minutes
    ^^^^^^^^^^^^^^^^^
Craig
  • 1,929
  • 5
  • 30
  • 51
  • including 0 in the formula suppresses intercept. Read more here: https://stats.stackexchange.com/questions/174298/what-does-the-formula-y-x-0-in-r-actually-calculate – dmi3kno Sep 06 '17 at 20:22
  • Thanks @dmi3kno - but even then how would a log of a vector depend on another vector? Isn't the stuff on the left side of the `~` independent? – Craig Sep 06 '17 at 20:24

1 Answers1

3

A multiple linear regression equation is of the form y = b0 + b1x1 + b2x2 + ... +bkxk where b0 is the intercept or the constant. You can exclude this constant from the model by using 0 + in R. Another way of doing that is to use - 1 which works both in R and patsy. So you need to change your result to:

result = sm.ols(formula="np.log(in_data) ~ minutes - 1", data=model_data).fit()
ayhan
  • 70,170
  • 20
  • 182
  • 203
  • Thanks @ayhan but I'm still getting the same error: `sm.ols(formula="np.log(in_data) ~ minutes - 1", data=model_data).fit()`, resulting in ` patsy.PatsyError: Number of rows mismatch between data argument and np.log(in_data) (1 versus 4) np.log(in_data) ~ minutes - 1 ^^^^^^^^^^^^^^^^^ ` – Craig Sep 06 '17 at 20:27
  • @Craig Could you tell us the respective dimensions? – Marvin Taschenberger Sep 06 '17 at 20:30
  • @Craig Sorry I thought the error came from the use of `+ 0`. Can you include a reproducible sample for the Python example as well? Because with your R example I am able to produce correct results (I used `model_data = pd.DataFrame({'in_data': in_data, 'minutes': minutes})`) where `in_data = (0.5, 0.6, 0.7)` and `minutes = (30, 60, 90)`. – ayhan Sep 06 '17 at 20:30
  • Thanks @ayhan - the problem was in the way I had constructed the dataframe. Looks like it's working now. One more dumb question please - I see the correct value of the coefficient when I print result.summary(), how would I extract that value and assign it to a variable? – Craig Sep 06 '17 at 20:38
  • @MarvinTaschenberger it was a dumb user error - I had constructed the frame wrongly – Craig Sep 06 '17 at 20:39
  • @Craig You need to access the `params` attribute. `result.params` returns a Series where the keys are the names of the variables and the values are the coefficients. Since you only have one variable this will be a one element Series. You can further take the value with `result..params['minutes']`. You might want to check `dir(result)` to see what other attributes that object has. – ayhan Sep 06 '17 at 20:41