Optimizing number of optimum features

Question

I am training neural network using Keras. Every time I train my model, I use slightly different set of features selected using Tree-based feature selection via ExtraTreesClassifier(). After training every time, I compute the AUCROC on my validation set and then go back in a loop to train the model again with different set of feature. This process is very inefficient and I want to select the optimum number of features using some optimization technique available in some python library. The function to be optimized is the auroc for cross validation which can only be calculated after training the model on selected features. The features are selected via following function ExtraTreesClassifier(n_estimators=10, criterion=’gini’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’) Here we see that the objective function is not directly dependent on the parameters to be optimized. The objective function which is auroc is related to the neural network training and the neural network takes features as input which are extracted on the basis of their important from ExtraTreesClassifier. So in a way, the parameters for which I optimize auroc are n_estimators=10, criterion=’gini’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’ or some other variables in ExtraTreesClassifier. These are not directly related to auroc.

May be I need to use `sklearn pipeline`. – Stupid420 Feb 08 '18 at 00:39 — Stupid420, Feb 08 '18 at 00:39

score 1 · Accepted Answer · answered Feb 08 '18 at 05:42

You should combine GridSearchCV and Pipeline. Find more here Use Pipeline when you need to run a set of instruction in sequence to get the optimal config.

For example, you have these steps to run: 1. Select KBest feature(s) 2. Use classifier DecisionTree or NaiveBayes

By combining GridSearchCV and Pipeline, you can select which features that best for a particular classifier, best config on the classifier, and so on, based on the scoring criteria.

Example:

#set your configuration options 
param_grid = [{
    'classify': [DecisionTreeClassifier()], #first option use DT
    'kbest__k': range(1, 22), #range of n in SelectKBest(n)

    #classifier's specific configs
    'classify__criterion': ('gini', 'entropy'), 
    'classify__min_samples_split': range(2,10),
    'classify__min_samples_leaf': range(1,10)
},
{
    'classify': [GaussianNB()], #second option use NB
    'kbest__k': range(1, 22), #range of n in SelectKBest(n)
}]

pipe =  Pipeline(steps=[("kbest", SelectKBest()), ("classify",  DecisionTreeClassifier())]) #I put DT as default, but eventually the program will ignore this when you use GridSearchCV.

# Here the might of GridSearchCV working, this may takes time especially if you have more than one classifiers to be evaluated
grid = GridSearchCV(pipe, param_grid=param_grid, cv=10, scoring='f1')
grid.fit(features, labels)

#Find your best params if you want to use optimal setting later without running the grid search again (by commenting all these grid search lines)
print grid.best_params_

#You can now use pipeline again to wrap the steps with it best configs to build your model
pipe =  Pipeline(steps=[("kbest", SelectKBest(k=12)), ("classify",  DecisionTreeClassifier(criterion="entropy", min_samples_leaf=2, min_samples_split=9))])

Hope this helps

I implemented it here with details but I am still confused in applying the pipeline with Grid Search. (https://stackoverflow.com/questions/48730921/optimizing-two-estimators-dependent-on-each-other-using-sklearn-grid-search) — Stupid420, Feb 11 '18 at 15:17

score 0 · Answer 2 · answered Feb 11 '18 at 15:23

The flow of my program is in two stages.

I am using Sklearn ExtraTreesClassifier along with SelectFromModelmethod to select the most important features. Here it should be noted that the ExtraTreesClassifier takes many parameters as input like n_estimators etc for classification and eventually giving different set of important features for different values of n_estimators via SelectFromModel. This means that I can optimize the n_estimators to get the best features.

In the second stage, I am traing my NN keras model based on the features selected in the first stage. I am using AUROC as the score for grid search but this AUROC is calculated using Keras based neural network. I want to use Grid Search for n_estimators in my ExtraTreesClassifier to optimize the AUROC of keras neural Network. I know I have to use Pipline but I am confused in implementing both together.

I don't know where to put Pipeline in my code. I am getting an error which saysTypeError: estimator should be an estimator implementing 'fit' method, <function fs at 0x0000023A12974598> was passed

#################################################################################
I concatenate the CV set and the train set so that I may select the most important features  
in both CV and Train together.
##############################################################################

frames11 = [train_x_upsampled, cross_val_x_upsampled]
train_cv_x = pd.concat(frames11)
frames22 = [train_y_upsampled, cross_val_y_upsampled]
train_cv_y = pd.concat(frames22)


def fs(n_estimators):
  m = ExtraTreesClassifier(n_estimators = tree_number)
  m.fit(train_cv_x,train_cv_y)
  sel = SelectFromModel(m, prefit=True)


  ##################################################
  The code below is to get the names of the selected important features
  ###################################################

  feature_idx = sel.get_support()
  feature_name = train_cv_x.columns[feature_idx]
  feature_name =pd.DataFrame(feature_name)

  X_new = sel.transform(train_cv_x)
  X_new =pd.DataFrame(X_new)

 ######################################################################
 So Now the important features selected are in the data-frame X_new. In 
 code below, I am again dividing the data into train and CV but this time 
 only with the important features selected.
 #################################################################### 

  train_selected_x = X_new.iloc[0:train_x_upsampled.shape[0], :]
  cv_selected_x = X_new.iloc[train_x_upsampled.shape[0]:train_x_upsampled.shape[0]+cross_val_x_upsampled.shape[0], :]

  train_selected_y = train_cv_y.iloc[0:train_x_upsampled.shape[0], :]
  cv_selected_y = train_cv_y.iloc[train_x_upsampled.shape[0]:train_x_upsampled.shape[0]+cross_val_x_upsampled.shape[0], :]

  train_selected_x=train_selected_x.values
  cv_selected_x=cv_selected_x.values
  train_selected_y=train_selected_y.values
  cv_selected_y=cv_selected_y.values

  ##############################################################
  Now with this new data which only contains the important features,
  I am training a neural network as below.
  #########################################################
  def create_model():
     n_x_new=train_selected_x.shape[1]

     model = Sequential()
     model.add(Dense(n_x_new, input_dim=n_x_new, kernel_initializer='glorot_normal', activation='relu'))
     model.add(Dense(10, kernel_initializer='glorot_normal', activation='relu'))
     model.add(Dropout(0.8))

     model.add(Dense(1, kernel_initializer='glorot_normal', activation='sigmoid'))
     optimizer = keras.optimizers.Adam(lr=0.001)


     model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])

  seed = 7
  np.random.seed(seed)

model = KerasClassifier(build_fn=create_model, epochs=20, batch_size=400, verbose=0)

n_estimators=[10,20,30]
param_grid = dict(n_estimators=n_estimators)

grid = GridSearchCV(estimator=fs, param_grid=param_grid,scoring='roc_auc',cv = PredefinedSplit(test_fold=my_test_fold), n_jobs=1)
grid_result = grid.fit(np.concatenate((train_selected_x, cv_selected_x), axis=0), np.concatenate((train_selected_y, cv_selected_y), axis=0))

Optimizing number of optimum features

2 Answers2

Linked