-1

My code:

X = data['text_with_tokeniz_lemmatiz'] y = data['toxic'] X_train, X_tmp, y_train, y_tmp = train_test_split(X, y, train_size=0.8, test_size=0.2, shuffle=False, random_state=12345) X_valid, X_test, y_valid, y_test = train_test_split(X_tmp, y_tmp, test_size=0.5, shuffle=False, random_state=12345)

The inspector wrote to me: "You use both validation sampling and cross-validation at the same time. It would be better to transfer the entire project to cross-validation and increase the amount of data in training."

How to fix it?

i dont know(((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((

Kirill
  • 1
  • 2

1 Answers1

0

When using a validation dataset, we usually train a model on the training data and evaluate its performance on the validation data.

Cross-validation is essentially the same thing, but done multiple times, with different splits.

As your inspector suggests, it is not necessary to split the validation data yourself, as this is already done during cross-validation.

It is hard to say how you can fix it when we don't see how you use the validation data in the code. From what I see, you first need to get rid of the validation data entirely, so the code would look like:

X = data['text_with_tokeniz_lemmatiz']
y = data['toxic']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.9, test_size=0.1, shuffle=False, random_state=12345)

If latter in the code you use validation data to measure the performance of your learning algorithm, you can replace that with cross-validation, for instance using Scikit Learn's cross_val_score.

ErwanF
  • 160
  • 6
  • Could you show how to do it in one version on the validation set, and in the other version like using cross-validation? The reviewer wrote a comment specifically for train_test_split – Kirill Jan 09 '23 at 21:59
  • Well I can't know without having the rest of the code - that will depend on your use of the validation data. Cross validation can be used like that: from sklearn.model_selection import cross_val_score model = ... train_x, train_y = ... scores = cross_val_score(model, train_x, train_y) – ErwanF Jan 09 '23 at 22:07