0

I am creating a program using past datasets to predict an employees salary for any job. I recieve the error "Warning: The least populated class in y has only 1 members, which is too few. The minimum number of members in any class cannot be less than n_splits=5."

p_train, p_test, t_train, t_test = train_test_split(predictors, target target, test_size=0.25, random_state=1)
model = KNeighborsClassifier()
param_grid = {'n_neighbors': np.arange(1, 25)}
modelGSCV = GridSearchCV(model, param_grid, cv=5)

Here is where I tried splitting and received the error. I am pretty new to Machine Learning so would appreciate if anyone could guide me on how to fix this.

Aditya S
  • 53
  • 1
  • 3
  • 7

1 Answers1

1

From the GridSearchCV documentation:

For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used.

You must have a multiclass classification problem. Since StratifiedKFold is used, you need to have at least 5 examples of each class in your data. If you have at least one class with < 5 examples, this error will be thrown.

A simple solution would be to drop classes with < 5 examples or to reduce the number of folds.

Sesquipedalism
  • 1,573
  • 14
  • 12
  • How would I do that? – Aditya S Jul 11 '19 at 22:51
  • 1
    Are you using pandas? Assuming yes, as a starting point, you could group by the class column and count how many rows there are for each class. For example: df["CLASS_COUNT"] = df.groupby("TARGET_COL").transform("count"). Then filter your DataFrame where "CLASS_COUNT" >= 5. – Sesquipedalism Jul 12 '19 at 13:57