1

I want to understand kfold more clearly and how to choose the best model after it is implemented as a cross-validation method.

According to this source: https://machinelearningmastery.com/k-fold-cross-validation/ the steps to carry out kfold are:

  1. Shuffle the dataset randomly
  2. Split the dataset into k groups
  3. For each unique group:

    • Take the group as a hold out or test data set

    • Take the remaining groups as a training data set

    • Fit a model on the training set and evaluate it on the test set

    • Retain the evaluation score and discard the model

4.Summarize the skill of the model using the sample of model evaluation scores

However, I have a question in relation to this process.

what is Retain the evaluation score and discard the model supposed to mean? how do you do it?

After my research, I believe it may have to do with the sklearn method cross_val_score(), but when I try to implement it, by passing my model to it, it throws the next error:

Traceback (most recent call last):

File "D:\ProgramData\Miniconda3\envs\Env_DLexp1\lib\site-packages\joblib\parallel.py", line 797, in dispatch_one_batch tasks = self._ready_batches.get(block=False) _queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "D:\temporary.py", line 187, in <module>
    scores = cross_val_score(model, X_test, y_test, cv=kf,scoring="accuracy")
  File "D:\ProgramData\Miniconda3\envs\Env_DLexp1\lib\site-packages\sklearn\model_selection\_validation.py", line 390, in cross_val_score
    error_score=error_score)
  File "D:\ProgramData\Miniconda3\envs\Env_DLexp1\lib\site-packages\sklearn\model_selection\_validation.py", line 236, in cross_validate
    for train, test in cv.split(X, y, groups))
  File "D:\ProgramData\Miniconda3\envs\Env_DLexp1\lib\site-packages\joblib\parallel.py", line 1004, in __call__
    if self.dispatch_one_batch(iterator):
  File "D:\ProgramData\Miniconda3\envs\Env_DLexp1\lib\site-packages\joblib\parallel.py", line 808, in dispatch_one_batch
    islice = list(itertools.islice(iterator, big_batch_size))
  File "D:\ProgramData\Miniconda3\envs\Env_DLexp1\lib\site-packages\sklearn\model_selection\_validation.py", line 236, in <genexpr>
    for train, test in cv.split(X, y, groups))
  File "D:\ProgramData\Miniconda3\envs\Env_DLexp1\lib\site-packages\sklearn\base.py", line 67, in clone
    % (repr(estimator), type(estimator)))
TypeError: Cannot clone object '<keras.engine.sequential.Sequential object at 0x00000267F9C851C8>' (type <class 'keras.engine.sequential.Sequential'>): it does not seem to be a scikit-learn estimator as it does not implement a 'get_params' methods.

According to the documentation,https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html, the first argument for cross_val_score() must be an estimator, which they define as "estimator object implementing ‘fit’.The object to use to fit the data."

Therefore, I can't understand the exception.

This is the relevant part of my code:

model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen))
model.add(Conv1D(filters=32, kernel_size=8, activation='relu'))
model.add(BatchNormalization(weights=None, epsilon=1e-06, momentum=0.9))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(Dense(10, activation='relu'))
model.add(Dense(4, activation='softmax'))
print(model.summary())

from sklearn.metrics import precision_recall_fscore_support
from sklearn.model_selection import GridSearchCV,cross_val_score

kf = KFold(n_splits=4, random_state=None, shuffle=True)
print(kf)

for train_index, test_index in kf.split(data):
     print("TRAIN:", train_index, "TEST:", test_index)
     X_train, X_test = data[train_index], data[test_index]
     y_train, y_test = labels[train_index], labels[test_index] 


Adam=keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999, amsgrad=False)
model.compile(optimizer=Adam,
                       loss='sparse_categorical_crossentropy',
                       metrics=['sparse_categorical_accuracy'])
history = model.fit(X_train, y_train,
                            epochs=15, 
                            batch_size=32,
                            verbose=1,  
                            callbacks=callbacks_list, 
                            validation_data=(X_test, y_test)
                       )


scores = cross_val_score(model, X_test, y_test, cv=kf,scoring="accuracy")
print(scores)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

I would appreciate any help you can give me. Please take into consideration I am not a data scientist or a developer.

1 Answers1

0

what is Retain the evaluation score and discard the model supposed to mean?

Retain the evaluation means to save the evaluation of the actual model tested in the CV iteration, to just save it in memory to compare with the next evaluations.

how do you do it?

You could use the cross_val_score() of sklearn when using sklearn algorithms, but you are working with keras, so you will need to work with the class KFold, have a look to this kaggle kernel, it shows the implementation you need. There are a lot of examples like this in the internet, just pick the one that you understand the most.

Therefore, I can't understand the exception.

The cross_val_score() accepts an estimator as a first parameter. What is an estimator? According to the documentation, an estimator is a class that implements has a defined structure, following this documentation.

As you can see, your keras model does not implement a part of the structure, so you get the error: it does not seem to be a scikit-learn estimator as it does not implement a 'get_params' methods.

Noki
  • 870
  • 10
  • 22