0
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

this is the error I got from following code

# List of machine learning algorithms that will be used for predictions
estimator = [('Logistic Regression', LogisticRegression), ('Ridge Classifier', RidgeClassifier), 
             ('SGD Classifier', SGDClassifier), ('Passive Aggressive Classifier', PassiveAggressiveClassifier), 
             ('SVC', SVC), ('Linear SVC', LinearSVC), ('Nu SVC', NuSVC), 
             ('K-Neighbors Classifier', KNeighborsClassifier),
             ('Gaussian Naive Bayes', GaussianNB), ('Multinomial Naive Bayes', MultinomialNB), 
             ('Bernoulli Naive Bayes', BernoulliNB), ('Complement Naive Bayes', ComplementNB), 
             ('Decision Tree Classifier', DecisionTreeClassifier), 
             ('Random Forest Classifier', RandomForestClassifier), ('AdaBoost Classifier', AdaBoostClassifier), 
             ('Gradient Boosting Classifier', GradientBoostingClassifier), ('Bagging Classifier', BaggingClassifier), 
             ('Extra Trees Classifier', ExtraTreesClassifier), ('XGBoost', XGBClassifier)]

# Separating independent features and dependent feature from the dataset
#X_train = titanic.drop(columns='Survived')
#y_train = titanic['Survived']

# Creating a dataframe to compare the performance of the machine learning models
comparison_cols = ['Algorithm', 'Training Time (Avg)', 'Accuracy (Avg)', 'Accuracy (3xSTD)']
comparison_df = pd.DataFrame(columns=comparison_cols)

# Generating training/validation dataset splits for cross validation
cv_split = StratifiedShuffleSplit(n_splits=10, test_size=0.3, random_state=0)

# Performing cross-validation to estimate the performance of the models
for idx, est in enumerate(estimator):

    cv_results = cross_validate(est[1](), X, y, cv=cv_split)

    comparison_df.loc[idx, 'Algorithm'] = est[0]
    comparison_df.loc[idx, 'Training Time (Avg)'] = cv_results['fit_time'].mean()
    comparison_df.loc[idx, 'Accuracy (Avg)'] = cv_results['test_score'].mean()
    comparison_df.loc[idx, 'Accuracy (3xSTD)'] = cv_results['test_score'].std() * 3

comparison_df.set_index(keys='Algorithm', inplace=True)
comparison_df.sort_values(by='Accuracy (Avg)', ascending=False, inplace=True)

I guess cv_split part give me the problem
I found the solution to use train_test_split but this does not return it like cv_split

but strange thing is I used this code fine with other kaggle problem
so I try to compare the shape of data-frame for both kaggle

kaggle with no problem
print(X.shape)
print(y.shape)
(891, 9)
(891,)
array([0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1.....])

=============================================================

kaggle with problem(error)
print(X.shape)
print(y.shape)
(15035, 24)
(15035,)
array([221900., 180000., 510000., ..., 360000., 400000., 325000.])

the shape of both kernel looks same to me
I have no idea the difference of those two kernel's X,y.

anyone has any Idea why following error coming from?

2 Answers2

0

is your y picking up indexes values ..not sure though.. you can try StratifiedKFold instead..the below worked for me

kfold = StratifiedKFold(n_splits=10, random_state=7) results = cross_val_score(model, X_train, y_train, cv=kfold)

Shivangi
  • 26
  • 3
0

I had a similar error while using train_test_split. It was because I had assigned the parameter stratify=data instead of stratify=target.

Tambe Tabitha
  • 119
  • 2
  • 4