Scikit-Learn: how to perform best subset GLM Poisson Regression?

Question

Can someone show me how to perform best subset GLM Poisson regression using Pipeline and GridSearchCV? Specifically, I do not know which scikit-learn's function does best subset selection and how to embed it into pipeline and GridSearchCV

In addition, how do I include interaction terms within the features, to be selected from Best Subset Algorithm? And how do I embed this in pipeline and GridSearchCV ?

from sklearn.linear_model import PoissonRegressor
continuous_transformer = Pipeline(steps=[('std_scaler',StandardScaler())])
discrete_transformer = Pipeline(steps=[('encoder',OneHotEncoder(drop='first'))])
preprocessor =  ColumnTransformer(transformers = [('continuous',continuous_transformer,continuous_col),
                                                  ('discrete',discrete_transformer,discrete_col)],remainder='passthrough')

pipeline = Pipeline(steps=[('preprocessor',preprocessor),
                           ('glm_model',PoissonRegressor(alpha=0, fit_intercept=True))])

param_grid = {  ??? different combinations of features ????}

gs_en_cv = GridSearchCV(pipeline, param_grid=param_grid, cv=KFold(n_splits=10,shuffle = True,random_state=123), scoring = 'neg_root_mean_squared_error', n_jobs=-1, return_train_score=True)

Yev Guyduy · Accepted Answer · 2022-06-23T14:07:23.027

Currently as far as I understand sklearn has no "brute force" / exhaustive feature search for best subset. However there are various classes:

https://scikit-learn.org/0.15/modules/classes.html#module-sklearn.feature_selection
including some hidden interesting ones --- https://scikit-learn.org/0.15/auto_examples/feature_stacker.html#example-feature-stacker-py.

Now pipelining for this can be tricky. When you stack classes/methods in a pipeline and call .fit() all methods until final have to expose .transform(). If a method exposes .transform() then this .transform() is used as the input in your next step etc. In your last step you can have any valid model as a final object but all previous must expose .transform() in order to chain one to another. So depending on which feature selection approach you pick your code will differ. See below

Pablo Picasso is widely quoted as having said that “good artists borrow, great artists steal.”... So following this great answer https://stackoverflow.com/a/42271829/4471672 lets borrow, fix and expand a bit further.

Imports

### get imports
import itertools
from itertools import combinations
import pandas as pd
from tqdm import tqdm ### displays progress bar in your loop


from sklearn.pipeline import Pipeline
from sklearn.datasets import load_diabetes
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import RFE, SelectKBest
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.linear_model import PoissonRegressor

### if working in Jupyter notebooks allows multiple prints per cells
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

Data

X, y = load_diabetes(as_frame=True, return_X_y=True)

Supplemental function



### make parameter grid for your GridSearchCV 
### code borrowed and adjusted to work with Python 3.++ from answer mentioned above

def make_param_grids(steps, param_grids):

    final_params=[]

    # Itertools.product will do a permutation such that 
    # (pca OR svd) AND (svm OR rf) will become ->
    # (pca, svm) , (pca, rf) , (svd, svm) , (svd, rf)
    for estimator_names in itertools.product(*steps.values()):
        current_grid = {}

        # Step_name and estimator_name should correspond
        # i.e preprocessor must be from pca and select.
        for step_name, estimator_name in zip(steps.keys(), estimator_names):
            for param, value in param_grids.get(estimator_name).items():
                if param == 'object':
                    # Set actual estimator in pipeline
                    current_grid[step_name]=[value]
                else:
                    # Set parameters corresponding to above estimator
                    current_grid[step_name+'__'+param]=value
        #Append this dictionary to final params            
        final_params.append(current_grid)

    return final_params

#1 Example using RFE feature selection class

(aka where feature selection class does not return transform, but is of wrapper type)

### pipelines work from one step to another as long as previous step returns transform 

### adjust next steps to fit your problem space
### below in all_params_grid 

### RFE is a wrapper, that wraps your model, another similar feature selection algorithm that's a wrapper is 
### https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html 

pipeline_steps = {'transform':['ss'], ### if you wanted to try different steps here you could put them in the list ['ss', 'xx' etc] and would have to add 'xx' in your all_params_grid as well. or your pre processor mentioned in your question
                  'classifier':['rf']}

# fill parameters to be searched in this dict
all_param_grids = {'ss':{'object':StandardScaler(), ### here instead you could put your feature pre processing code, this is just as an example
                          'with_mean':[True,False]
                         }, 

                   'rf':{'object':RFE(estimator=PoissonRegressor(), 
                                        step=1,
                                        verbose=0),
                         'n_features_to_select':[1,2,3,4,5,6,7,8,9,10], ###change this parameter to  1  for example to see how it influences accuracy of your grid search
                         'estimator__fit_intercept':[True,False], ### tuning your models hyperparams
                         'estimator__alpha':[0.1,0.5,0.7,1] #### tuning your models hyperparams
                            }
                  }  

# Call the method on the above declared variables
param_grids_list = make_param_grids(pipeline_steps, all_param_grids)
param_grids_list

### put your pipe together and put xyz() classes as placeholders to initialize your pipeline in case you for example use StandardScaler() AND another transform from steps above
### at .fit() all parameters passed from param grid will be passed and evaluated
pipe = Pipeline(steps=[('transform',StandardScaler()),
                       ('classifier',RFE(estimator=PoissonRegressor()))])
pipe

### run it
gs_en_cv = GridSearchCV(pipe, 
                        param_grid=param_grids_list,
                        cv=KFold(n_splits=3,
                                 shuffle = True,
                                 random_state=123),
                       scoring = 'neg_root_mean_squared_error',
                       return_train_score=True,
                        
                        ### change verbose to higher number for more print outs
                        ### about fitting info which can also verify that 
                        ### all parameters you specify are getting fit 
                       verbose = 1)

gs_en_cv.fit(X,y)

f"``````````````````````````````````````````````````````````````````````````````````````"
f"best score is {gs_en_cv.best_score_}"
f"``````````````````````````````````````````````````````````````````````````````````````"
f"best params are"
gs_en_cv.best_params_
f"good luck"

#2 Example using KBest feature selection class

(an example where feature selection exposes .transform()


pipeline_steps = {'transform':['ss'],
                  'select':['kbest'], ### if you have another feature selector that exposes .transform() you could put it in the list and add to all_params_grid and that would produce a grid for all variations transform -> select[1] -> classifier and another transform -> select[2] -> classifier
                  'classifier':['pr']}

# fill parameters to be searched in this dict
all_param_grids = {'ss':{'object':StandardScaler(),
                          'with_mean':[True,False]
                         }, 
                   
                   'kbest': {'object': SelectKBest(),
                             'k' : [1,2,3,4,5,6,7,8,9,10] ### change this parameter to 1 to see how it influences accuracy during grid search and to validate it influences your next step
                             },

                   'pr':{'object':PoissonRegressor(verbose=2),
                         'alpha':[0.1,0.25,0.5,0.75,1], ### tuning your
                         'fit_intercept':[True,False], ### tuning your models hyperparams
                            }
                  }  

# Call the method on the above declared variables
param_grids_list = make_param_grids(pipeline_steps, all_param_grids)
param_grids_list

pipe = Pipeline(steps=[('transform',StandardScaler()),
                       ( 'select', SelectKBest()), ### again if you used two steps here in your param grid, no need to put them here, only putting SelectKBest() as an intializer for the pipeline
                       ('classifier',PoissonRegressor())])
pipe

### run it
gs_en_cv = GridSearchCV(pipe, 
                        param_grid=param_grids_list,
                        cv=KFold(n_splits=3,
                                 shuffle = True,
                                 random_state=123),
                       scoring = 'neg_root_mean_squared_error',
                       return_train_score=True,
                        
                        ### change verbose to higher number for more print outs
                        ### about fitting info which can also verify that 
                        ### all parameters you specify are getting fit 
                       verbose = 1)

gs_en_cv.fit(X,y)

f"``````````````````````````````````````````````````````````````````````````````````````"
f"best score is {gs_en_cv.best_score_}"
f"``````````````````````````````````````````````````````````````````````````````````````"
f"best params are"
gs_en_cv.best_params_
f"good luck"

#3 Brute force / looping over all possible combinations with pipeline

pipeline_steps = {'transform':['ss'],
                  'classifier':['pr']}

# fill parameters to be searched in this dict
all_param_grids = {'ss':{'object':StandardScaler(),
                          'with_mean':[True,False]
                         }, 
                   'pr':{'object':PoissonRegressor(verbose=2),
                         'alpha':[0.1,0.25,0.5,0.75,1], ### tuning your models hyperparams
                         'fit_intercept':[True,False], ### tuning your models hyperparams
                            }
                  }  

# Call the method on the above declared variables
param_grids_list = make_param_grids(pipeline_steps, all_param_grids)
param_grids_list

pipe = Pipeline(steps=[('transform',StandardScaler()),
                       ('classifier',PoissonRegressor())])
pipe

feature_combo = []  ### record feature combination
score = [] ### record GrodSearchCV best score
params = [] ### record params of best score

stuff = list(X.columns)
for L in tqdm(range(1, len(stuff)+1)): ### tqdm lets you see overall progress bar here
    for subset in itertools.combinations(stuff, L): ### create all possible combinations of features
        ### run it
        gs_en_cv = GridSearchCV(pipe, 
                                param_grid=param_grids_list,
                                cv=KFold(n_splits=3,
                                         shuffle = True,
                                         random_state=123),
                               scoring = 'neg_root_mean_squared_error',
                               return_train_score=True,

                                ### change verbose to higher number for more print outs
                                ### about fitting info which can also verify that 
                                ### all parameters you specify are getting fit 
                               verbose = 0)

        fitted = gs_en_cv.fit(X[list(subset)],y)
    
        score.append(fitted.best_score_) ### append results
        params.append(fitted.best_params_) ### append results
        feature_combo.append(list(subset)) ### append results

### assemble your dataframe, sort and print out top feature combo and model params results
df = pd.DataFrame({'feature_combo':feature_combo,
                   'score':score,
                   'params':params})

df.sort_values(by='score', ascending=False,inplace=True)
df.head(1)
df.head(1).params.iloc[0]

PS
For interactions (I guess you mean like creating new features by combining originals?) I would just create those feature interactions before and include them at your .fit() because otherwise how do you know if for example you get the best interaction features since you are doing your interactions AFTER you selected a subset of them? Why not interact them from the start and let gridCV feature selection portion tell you whats best?

Thank you very much for your answer. There is one bit that i don't understand, why within `all_param_grids`, the `PoissonRegressor()` appear twice, once in RFE estimator; once in 'estimator'. And why is DecisionTreeRegressor() there after PoissonRegressor as well ? What does it mean to have `DecisionTreeRegressor` and `PoissonRegressor` within the estimator ? `'rfe':{'object':RFE(estimator=PoissonRegressor()), 'estimator': [PoissonRegressor(), DecisionTreeRegressor()],` — user1769197, Jun 19 '22 at 07:15
fyi this is somewhat of a starter code, I didnt include `ALL` https://scikit-learn.org/stable/modules/feature_selection.html classes but its easy to expand my code to all. `PoissonRegressor()` appears twice because once I use it for the MODEL and another time WITHIN RFE which needs estimators as one of its parameters. Including `PoissonRergressor` and `DecisionTreeRegressor` in RFE means CV will treat it as a hyperparameter to try. You can add other models within RFE as well and they will be tried in your cross validation I just put 2 there so that you can see how to utilize it. — Yev Guyduy, Jun 19 '22 at 12:00
fyi I slightly edited my answer with more feature selection classes and more comments — Yev Guyduy, Jun 19 '22 at 12:54
fyi #2, my original answer didnt sit too well with me so I went back to experiment and as a result rewrite and in that process also discovered a bunch SO answers that have not properly used these methods in other similar questions .... take a look at updates, check comments for where I changed things to make sure outputs are effected. Note that method #3 runs for about 3 mins on my end even with 10 features — Yev Guyduy, Jun 22 '22 at 21:28
For #3, I don't understand the reason behind re-writing #2. What is missing from `SelectKBest` that led you to write #3 using a for loop to create all the possible combinations ? Isn't `SelectKBest` doing an exhaustive search over all possible combinations already ? — user1769197, Jun 29 '22 at 19:28
I want to say this is already a superb answer. It would be perfect, if you can also include in the pipeline how to transform the target feature `y` by standard scaling `with_mean = ['True','False']`, assuming `y` is a continuous feature. This would be much appreciated. — user1769197, Jun 29 '22 at 19:38
for #3, SelectKBest uses a scoring function within to selectKbest features using K score. So for example you can pass chi2 or f_regression as your scoring function to it and a cut off point to select your features based on score returned.. its not really doing an exhaustive search over `ALL` possible combinations. i bit swamped right now but i will include SS for y once i dig myself out a bit — Yev Guyduy, Jul 01 '22 at 15:27
sorry. I thought "itertools.combinations(stuff, L):" means you are doing an exhaustive search over all combinations — user1769197, Jul 01 '22 at 16:48
yes my #3 does exactly that, because neither #1 or #2 does it — Yev Guyduy, Jul 01 '22 at 17:01

Scikit-Learn: how to perform best subset GLM Poisson Regression?

1 Answers1

#1 Example using RFE feature selection class

#2 Example using KBest feature selection class

#3 Brute force / looping over all possible combinations with pipeline