0

I have written the following algorithm to implement a Ridge regression and estimate its parameter via cross validation. In particular, I wanted to achieve the following:

  1. For the purpose of cross-validation, the train set is divided into 10 folds. The first time, the model is estimated on fold 1 and validated on fold 2; the second time it is estimated on folds 1-2 and validated on fold 3, ..., the 9th time it is estimated on folds 1-9 and validated on fold 10.
  2. For each of the 9 estimations above, I want the train features to be z-scored and the validation features to be z-scored using the mean and variance of the fold(s) used in training.

I am doing something wrong in the pipeline to implement point 2 but I can't figure out what. Could I have your opinion on the implementation below?

# Create two lists of the indexes of the train and test sets as per point 1

n_splits=10
kf = KFold(n_splits=n_splits, shuffle=False)

folds = [idx for _, idx in kf.split(df_train)]
indexes_train = [folds[0]]
indexes_test = [folds[1]]

for i in range(1,n_splits-1):

   indexes_train.append(np.concatenate((np.array(indexes_train[i-1]), folds[i])))
   indexes_test.append(folds[i+1])
        
# Tune the model as per point 2

pipe = Pipeline(steps = [('scaler', StandardScaler()), ('model', Ridge(fit_intercept=True))])
alpha_tune = {'model__alpha': self.alpha_values} 
cross_validation = [i for i in zip(indexes_train, indexes_test)]
model = GridSearchCV(estimator=pipe, param_grid=alpha_tune, cv=cross_validation, scoring='neg_mean_squared_error', n_jobs=-1).fit(features_train, labels_train)

best_alpha = model.best_params_['model__alpha']
NC520
  • 346
  • 3
  • 13

0 Answers0