32

How to perform stepwise regression in python? There are methods for OLS in SCIPY but I am not able to do stepwise. Any help in this regard would be a great help. Thanks.

Edit: I am trying to build a linear regression model. I have 5 independent variables and using forward stepwise regression, I aim to select variables such that my model has the lowest p-value. Following link explains the objective:

https://www.google.co.in/url?sa=t&rct=j&q=&esrc=s&source=web&cd=4&ved=0CEAQFjAD&url=http%3A%2F%2Fbusiness.fullerton.edu%2Fisds%2Fjlawrence%2FStat-On-Line%2FExcel%2520Notes%2FExcel%2520Notes%2520-%2520STEPWISE%2520REGRESSION.doc&ei=YjKsUZzXHoPwrQfGs4GQCg&usg=AFQjCNGDaQ7qRhyBaQCmLeO4OD2RVkUhzw&bvm=bv.47244034,d.bmk

Thanks again.

user2174063
  • 321
  • 1
  • 3
  • 4
  • 4
    `scikits.learn` has LARS/lasso, if that's of any use: http://scikit-learn.org/dev/modules/linear_model.html#lars-lasso – NPE Mar 15 '13 at 13:13
  • Can you elaborate on what sort of criteria you want to use for choice of predictive variables? And if you want an example, can you post or link to some sample data? – BKay Apr 03 '13 at 13:29
  • 1
    It's not advisable to base a model on p-values. They are more of a sanity check, and other criterion, such as AIC or BIC, are more suitable. – Max Candocia Jul 23 '15 at 22:10
  • 1
    Link seems to be broken: `We're sorry, the page you've requested could not be located. You can return to the Mihaylo Home Page or report an error to the Webmaster.` – vestland Aug 07 '18 at 06:37

7 Answers7

13

You may try mlxtend which got various selection methods.

from mlxtend.feature_selection import SequentialFeatureSelector as sfs

clf = LinearRegression()

# Build step forward feature selection
sfs1 = sfs(clf,k_features = 10,forward=True,floating=False, scoring='r2',cv=5)

# Perform SFFS
sfs1 = sfs1.fit(X_train, y_train)
Regi Mathew
  • 2,603
  • 3
  • 24
  • 38
12

Trevor Smith and I wrote a little forward selection function for linear regression with statsmodels: http://planspace.org/20150423-forward_selection_with_statsmodels/ You could easily modify it to minimize a p-value, or select based on beta p-values with just a little more work.

Aaron Schumacher
  • 3,695
  • 2
  • 23
  • 23
3

You can make forward-backward selection based on statsmodels.api.OLS model, as shown in this answer.

However, this answer describes why you should not use stepwise selection for econometric models in the first place.

David Dale
  • 10,958
  • 44
  • 73
  • I would just like to point out that data partitioning is supposed to answer the problems of overfitting/data dredging that is raised in the article linked by David. One of the answers posted is about data partitioning: https://stats.stackexchange.com/a/20860/48197 having said that, the texts (Data Mining for Business Analytics by Wiley) discusses methods for data partitioning. In other words, stepwise should be okay as long as you don't use the results of the training model in production, you need to do something like k-folds testing against validation data to end up with a workable list. – thistleknot Oct 23 '18 at 18:29
2

I developed this repository https://github.com/xinhe97/StepwiseSelectionOLS

My Stepwise Selection Classes (best subset, forward stepwise, backward stepwise) are compatible to sklearn. You can do Pipeline and GridSearchCV with my Classes.

The essential part of my code is as follows:

################### Criteria ###################
def processSubset(self, X,y,feature_index):
    # Fit model on feature_set and calculate rsq_adj
    regr = sm.OLS(y,X[:,feature_index]).fit()
    rsq_adj = regr.rsquared_adj
    bic = self.myBic(X.shape[0], regr.mse_resid, len(feature_index))
    rsq = regr.rsquared
    return {"model":regr, "rsq_adj":rsq_adj, "bic":bic, "rsq":rsq, "predictors_index":feature_index}

################### Forward Stepwise ###################
def forward(self,predictors_index,X,y):
    # Pull out predictors we still need to process
    remaining_predictors_index = [p for p in range(X.shape[1])
                            if p not in predictors_index]

    results = []
    for p in remaining_predictors_index:
        new_predictors_index = predictors_index+[p]
        new_predictors_index.sort()
        results.append(self.processSubset(X,y,new_predictors_index))
        # Wrap everything up in a nice dataframe
    models = pd.DataFrame(results)
    # Choose the model with the highest rsq_adj
    # best_model = models.loc[models['bic'].idxmin()]
    best_model = models.loc[models['rsq'].idxmax()]
    # Return the best model, along with model's other  information
    return best_model

def forwardK(self,X_est,y_est, fK):
    models_fwd = pd.DataFrame(columns=["model", "rsq_adj", "bic", "rsq", "predictors_index"])
    predictors_index = []

    M = min(fK,X_est.shape[1])

    for i in range(1,M+1):
        print(i)
        models_fwd.loc[i] = self.forward(predictors_index,X_est,y_est)
        predictors_index = models_fwd.loc[i,'predictors_index']

    print(models_fwd)
    # best_model_fwd = models_fwd.loc[models_fwd['bic'].idxmin(),'model']
    best_model_fwd = models_fwd.loc[models_fwd['rsq'].idxmax(),'model']
    # best_predictors = models_fwd.loc[models_fwd['bic'].idxmin(),'predictors_index']
    best_predictors = models_fwd.loc[models_fwd['rsq'].idxmax(),'predictors_index']
    return best_model_fwd, best_predictors
HE Xin
  • 31
  • 2
  • 2
    while I appreciate your contribution, I cannot resist but to note that model selection solely on the r2 (as it is done here?) is not a good idea. – Moritz Mar 28 '21 at 17:22
1

Statsmodels has additional methods for regression: http://statsmodels.sourceforge.net/devel/examples/generated/example_ols.html. I think it will help you to implement stepwise regression.

Matti Pastell
  • 9,135
  • 3
  • 37
  • 44
1
"""Importing the api class from statsmodels"""
import statsmodels.formula.api as sm

"""X_opt variable has all the columns of independent variables of matrix X 
in this case we have 5 independent variables"""
X_opt = X[:,[0,1,2,3,4]]

"""Running the OLS method on X_opt and storing results in regressor_OLS"""
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

Using the summary method, you can check in your kernel the p values of your variables written as 'P>|t|'. Then check for the variable with the highest p value. Suppose x3 has the highest value e.g 0.956. Then remove this column from your array and repeat all the steps.

X_opt = X[:,[0,1,3,4]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

Repeat these methods until you remove all the columns which have p value higher than the significance value(e.g 0.05). In the end your variable X_opt will have all the optimal variables with p values less than significance level.

Adalcar
  • 1,458
  • 11
  • 26
Varun-08
  • 11
  • 2
0

Here's a method I just wrote that uses "mixed selection" as described in Introduction to Statistical Learning. As input, it takes:

  • lm, a statsmodels.OLS.fit(Y,X), where X is an array of n ones, where n is the number of data points, and Y, where Y is the response in the training data

  • curr_preds- a list with ['const']

  • potential_preds- a list of all potential predictors. There also needs to be a pandas dataframe X_mix that has all of the data, including 'const', and all of the data corresponding to the potential predictors

  • tol, optional. The max pvalue, .05 if not specified

def mixed_selection (lm, curr_preds, potential_preds, tol = .05):
  while (len(potential_preds) > 0):
    index_best = -1 # this will record the index of the best predictor
    curr = -1 # this will record current index
    best_r_squared = lm.rsquared_adj # record the r squared of the current model
    # loop to determine if any of the predictors can better the r-squared  
    for pred in potential_preds:
      curr += 1 # increment current
      preds = curr_preds.copy() # grab the current predictors
      preds.append(pred)
      lm_new = sm.OLS(y, X_mix[preds]).fit() # create a model with the current predictors plus an addional potential predictor
      new_r_sq = lm_new.rsquared_adj # record r squared for new model
      if new_r_sq > best_r_squared:
        best_r_squared = new_r_sq
        index_best = curr

    if index_best != -1: # a potential predictor improved the r-squared; remove it from potential_preds and add it to current_preds
      curr_preds.append(potential_preds.pop(index_best))
    else: # none of the remaining potential predictors improved the adjust r-squared; exit loop
      break

    # fit a new lm using the new predictors, look at the p-values
    pvals = sm.OLS(y, X_mix[curr_preds]).fit().pvalues
    pval_too_big = []
    # make a list of all the p-values that are greater than the tolerance 
    for feat in pvals.index:
      if(pvals[feat] > tol and feat != 'const'): # if the pvalue is too large, add it to the list of big pvalues
        pval_too_big.append(feat)

    # now remove all the features from curr_preds that have a p-value that is too large
    for feat in pval_too_big:
      pop_index = curr_preds.index(feat)
      curr_preds.pop(pop_index)
Jacob Helwig
  • 41
  • 1
  • 5