Code for nested cross-validation of lasso regression

Question

I am very new to python, and my code is a slight change of this code. Currently it is throwing the following bug (and I don't understand why):

File "LassoNested.py", line 51, in <module>
lasso_regressor.fit(X_test_inner, y_test_inner), line 1071, in fit
alphas = np.tile(np.sort(alphas)[::-1], (n_l1_ratio, 1))
IndexError: too many indices for array

Here the code - it is fitting a Lasso to my dataset, performing model evaluation in the outer loop, model selection in the inner loop (where it is finding optimal alpha values for the lasso):

import numpy as np
import operator
import csv
from sklearn import linear_model
from sklearn import cross_validation


if __name__ == "__main__":
# Load the training data
    X = np.loadtxt("train.csv",delimiter=',', usecols=range(1,15))
    y = np.loadtxt("train.csv", delimiter=',', usecols=range(15,16))
    print "Number of training samples: " + str(X.shape[0])



outer_scores = []

# outer cross-validation
outer = cross_validation.KFold(len(y), n_folds=3, shuffle=True)
for fold, (train_index_outer, test_index_outer) in enumerate(outer):
    X_train_outer, X_test_outer = X[train_index_outer], X[test_index_outer]
    y_train_outer, y_test_outer = y[train_index_outer], y[test_index_outer]

    inner_mean_scores = []

    # define explored parameter space.
    # procedure below should be equal to GridSearchCV
    #tuned_parameter = np.logscale(-10,2,15)
    alphas = [0.1,2,3]
    for param in alphas:

        inner_scores = []

        # inner cross-validation
        inner = cross_validation.KFold(len(X_train_outer), n_folds=3, shuffle=True)
        for train_index_inner, test_index_inner in inner:
            # split the training data of outer CV
            X_train_inner, X_test_inner = X_train_outer[train_index_inner], X_train_outer[test_index_inner]
            y_train_inner, y_test_inner = y_train_outer[train_index_inner], y_train_outer[test_index_inner]




            lasso_regressor = linear_model.Lasso(alphas=param, cv = 10, normalize=True, fit_intercept=True)
            lasso_regressor.fit(X_test_inner, y_test_inner)
            inner_scores.append(lasso_regressor.score(X_test_inner, y_test_inner))

        # calculate mean score for inner folds
        inner_mean_scores.append(np.mean(inner_scores))

    # get maximum score index
    index, value = max(enumerate(inner_mean_scores), key=operator.itemgetter(1))

    print 'Best parameter of %i fold: %i' % (fold + 1, tuned_parameter[index])

    # fit the selected model to the training set of outer CV
    # for prediction error estimation
    lasso_regressor2 = linear_model.Lasso(alphas=param, cv = 10, normalize=True, fit_intercept=True)
    lasso_regressor2.fit(X_train_outer, y_train_outer)
    outer_scores.append(lasso_regressor2.score(X_test_outer, y_test_outer))

# show the prediction error estimate produced by nested CV
print 'Unbiased prediction error: %.4f' % (np.mean(outer_scores))

# finally, fit the selected model to the whole dataset
lasso_regressor3 = linear_model.Lasso(alphas=param, cv=10, normalize=True,fit_intercept=True)
lasso_regressor3.fit(X, y)

Can you show us the output of `X_test_inner.shape` and `y_test_inner.shape`? — cel, Nov 01 '15 at 15:36
The `alphas` argument to `LassoCV` needs to be an array containing multiple alphas, but you are currently only passing it a scalar on each iteration. — ali_m, Nov 01 '15 at 17:28
`LassoCV` already does cross-validation internally in order to pick the best alpha value, so you can skip the inner `for` loop and just pass in your list/array of alphas once. — ali_m, Nov 01 '15 at 17:34
Thank you @ali_m I have changed quickly LassoCV to Lasso, but you are right oc. Regarding the datastructure: How do I do that best in python for the alphas? An array of arrays? — TestGuest, Nov 01 '15 at 17:40
I don't quite understand what you're trying to do. The only parameter you seem to be tuning here is alpha, and since `LassoCV` already does cross-validation to find the optimal alpha from a list/array of alphas there's no need for the inner pair of loops. You could replace them with a single call to `LassoCV` and `LassoCV.fit`. — ali_m, Nov 01 '15 at 17:52
Nested cross-validation is always necessary. There has to be an inner loop for finding the optimal parameter, and an outer loop with a left out fold for model evaluation. This is quite standard, no matter if there is only one parameter or more..I understood that I could write my code (the inner loop shorter with LassoCV), but if I do it explicitly like that, I would like to know which structure I need to use for the alphas — TestGuest, Nov 01 '15 at 17:56
What I mean is that you can get rid of `for param in alphas:...` and `for train_index_inner, test_index_inner in inner:...`. You would still be doing nested cross-validation, but the inner part (where you are finding the optimal alpha) can be done internally by `LassoCV`. You can pass your alphas to `LassoCV` as a list or an array (it doesn't really matter). — ali_m, Nov 01 '15 at 18:31

Code for nested cross-validation of lasso regression

0 Answers0