statsmodels linear regression - patsy formula to include all predictors in model

Question

Say I have a dataframe (let's call it DF) where y is the dependent variable and x1, x2, x3 are my independent variables. In R I can fit a linear model using the following code, and the . will include all of my independent variables in the model:

# R code for fitting linear model
result = lm(y ~ ., data=DF)

I can't figure out how to do this with statsmodels using patsy formulas without explicitly adding all of my independent variables to the formula. Does patsy have an equivalent to R's .? I haven't had any luck finding it in the patsy documentation.

score 36 · Answer 1 · answered Mar 13 '14 at 19:15

36

I haven't found . equivalent in patsy documentation either. But what it lacks in conciseness, it can make-up for by giving strong string manipulation in Python. So, you can get formula involving all variable columns in DF using

all_columns = "+".join(DF.columns - ["y"])

This gives x1+x2+x3 in your case. Finally, you can create a string formula using y and pass it to any fitting procedure

my_formula = "y~" + all_columns
result = lm(formula=my_formula, data=DF)

answered Mar 13 '14 at 19:15

Sudeep Juvekar

4,898
3
29
35

1

I'm currently using this approach - it is certainly nice to be able to do string manipulation in Python! I just wanted to make sure I wasn't overlooking something in patsy. – Greg Mar 13 '14 at 19:21
Yeah, and I guessed `.` is not available precisely because of string manipulation. Another example of dynamically constructing formula: pick all variables staring with `x` and not starting with `z`. Messy to do in R, but simple in Python. But again, if anyone knows patsy better, I'd love to find alternatives. – Sudeep Juvekar Mar 13 '14 at 19:24
@SudeepJuvekar . I don't think so for the missing dot. R also have a strong and easy string manipulation(using paste,paste0) and provide a smart formula notation. So it is just a missing feature in pasty or not yet implemented. – agstudy Mar 13 '14 at 19:45
4

The only reason it's not implemented in patsy is that neither I nor anyone else has found the time to do it yet :-) There was one partial attempt here, from which the discussion is probably useful if someone else wants to have a go: https://github.com/pydata/patsy/pull/28 – Nathaniel J. Smith Mar 14 '14 at 02:20
Thats all peachy, but what do you do when you get a `Maximum Recursion Depth Error` for your constructed formula? I have a sufficiently larger of dimensions than samples (p>>N). Also when I try `lm(formula="Y~.", data=df)` the error is that the dot isn't understood correctly. Any inspiration on those? – bmc Apr 27 '18 at 13:53
11

`DF.columns - ["y"]` gave me an error. The syntax which worked for me was `DF.columns.difference(["y"])` – Jean Paul Jun 25 '18 at 15:47
@JeanPaul's reply worked for me instead of **DF.columns - ["y"]** – AwsAnurag Oct 08 '22 at 10:34

jseabold · Accepted Answer · 2014-03-14T15:19:48.770

11

No this doesn't exist in patsy yet, unfortunately. See this issue.

edited Mar 14 '14 at 15:19

answered Mar 13 '14 at 22:20

jseabold

7,903
2
39
53

score 7 · Answer 3 · answered Jul 01 '17 at 23:09

As this is still not included in patsy, I wrote a small function that I call when I need to run statsmodels models with all columns (optionally with exceptions)

def ols_formula(df, dependent_var, *excluded_cols):
    '''
    Generates the R style formula for statsmodels (patsy) given
    the dataframe, dependent variable and optional excluded columns
    as strings
    '''
    df_columns = list(df.columns.values)
    df_columns.remove(dependent_var)
    for col in excluded_cols:
        df_columns.remove(col)
    return dependent_var + ' ~ ' + ' + '.join(df_columns)

For example, for a dataframe called df with columns y, x1, x2, x3, running ols_formula(df, 'y', 'x3') returns 'y ~ x1 + x2'

Could you please provide an example with several columns to exclude? — NuValue, Feb 13 '19 at 13:12

statsmodels linear regression - patsy formula to include all predictors in model

3 Answers3

Linked

Related