-2

I tuned a RandomForest with GroupKFold (to prevent data leakage because some rows came from the same group).

I get a best fit model, but when I go to make a prediction on the test data it says that it needs the group feature.

Does that make sense? Its odd that the group feature is coming up as one of the most important features as well.

I'm just wondering if there is something I could be doing wrong.

Thanks

  • 1
    Can you please provide a minimal, reproducible example of your code? (https://stackoverflow.com/help/minimal-reproducible-example) – Kim Tang Sep 03 '20 at 07:59
  • I don't think that is necessary. This is a question on theory @KimTang – TheCuriouslyCodingFoxah Sep 03 '20 at 15:21
  • 1
    Okay, for me your current question is too vague to understand the problem. You trained a RandomForestClassifier with one of the folds created by GroupKFold and then when you predict, you get an error, asking for a "group feature"? What is this "group feature"? I could not find anything about it in the documentation for the RandomForestClassifier nor the GroupKFold. – Kim Tang Sep 03 '20 at 15:31
  • 2
    I agree with @KimTang, we are lacking details about what you are really doing, a code example could serve as a good base for discussion (no wording problems) and would eliminate any doubt about simple coding mistakes ! – Bruce Swain Sep 04 '20 at 13:50

2 Answers2

0

A search on the scikit-learn Github repo does not reveal a single instance of the string "group feature" or "group_feature" or anything similar, so I will go ahead and assume you have in your data set a feature called "group" that the prediction model requires as input in order to produce an output.

Remember that a prediction model is basically a function that takes an input (the "predictor" variable) and returns an output (the "predicted" variable). If a variable called "group" was defined as input for your prediction model, then it makes sense that scikit-learn would request it.

josephmure
  • 208
  • 1
  • 9
  • It shows it right here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GroupKFold.html. – TheCuriouslyCodingFoxah Sep 07 '20 at 12:27
  • You pass in group as part of the .split() method: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GroupKFold.html#sklearn.model_selection.GroupKFold.split – TheCuriouslyCodingFoxah Sep 07 '20 at 12:28
  • You don't need to split when you want to do a cross-validation, you simply pass your initiated cross-validation iterator object, say 'GroupKFold(n_splits=5)' to your 'sklearn.model_selection.GridSearchCV'. Help will be more effective with a minimum example. – SoufianeK Sep 08 '20 at 14:56
0

Does the group appear as a column on the training set? If so, remove it and re-train. It looks like you are just using it to generate splits. If it isn't a part of the input data you need to predict, it shouldn't be in the training set.