sklearn cross validation : The least populated class in y has only 1 members, which is less than n_splits=10

Question

i'm working in a machine learning project and i'm stuck with this warning when i try to use cross validation to know how many neighbours do i need to achieve the best accuracy in knn; here's the warning:

The least populated class in y has only 1 members, which is less than n_splits=10.

The dataset i'm using is https://archive.ics.uci.edu/ml/datasets/Student+Performance

In this dataset we have several attributes, but we'll be using only "G1", "G2", "G3", "studytime","freetime","health","famrel". all the instances in those columns are integers. https://i.stack.imgur.com/sirSl.png <-dataset example

Next,here's my first chunk of code where i assign the train and test groups:

import pandas as pd
import numpy as np
from google.colab import drive
drive.mount('/gdrive')
import sklearn

data=pd.read_excel("/gdrive/MyDrive/Colab Notebooks/student-por.xls")

#print(data.head())
data = data[["G1", "G2", "G3", "studytime","freetime","health","famrel"]]  
print(data)
predict = "G3"


x = np.array(data.drop([predict], axis=1))  
y = np.array(data[predict])  
print(y)
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.3, random_state=42)
print(len(y))
print(len(x))

That's how i assign x and y. with len, i can see that x and y have 649 rows both, representing 649 students.

Here's the second chunk of code when i do the cross_val:

#CROSSVALIDATION
from sklearn.neighbors import KNeighborsClassifier
neighbors = list (range(2,30))
cv_scores=[]
#print(y_train)

from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt

for k in neighbors:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn,x_train,y_train,cv=11,scoring='accuracy')
    cv_scores.append(scores.mean())
plt.plot(cv_scores)
plt.show()```

the code is pretty self explanatory as you can tell

The warning:

The least populated class in y has only 1 members, which is less than n_splits=10.

happens in every iteration of the for-loop

Although this warning happens every time, plt.show() is still able to plot a graph regarding which amount of neighbours is best to achieve a good accuracy, i dont know if the plot, or the readings in cv_scores are accurate.

my question is :

How my "class in y" has only 1 members, len(y) clearly says y have 649 instances, more than enough to be splitted in 59 groups of 11 members each one?, By members is it referring to "instances" in my dataset, or colums/labels in the y group?

I'm not using stratify=y when i do the train/test splits, it's seems to be the 1# solution to this warning but its useless in my case.

I've tried everything i've seen on google/stack overflow and nothing helped me, the dataset seems to be the problem, but i can´t understand whats wrong.

Class members are neither instances nor columns; it is how many instances belong to each class. Given that, the warning (not an error) is self-explanatory: one of your (multiple) classes has only one (1) instance in the whole dataset, so by definition that class cannot be present in *each* cross-validation fold, as is the normal requirement. — desertnaut, Dec 17 '20 at 00:47
Could you point where do i have that 1 instance?, it's something relating to the dataset, or something regarding my code? Also, is this warning completly invalidating the results that cross_val_score gives me? — B1N4RY B1RD, Dec 17 '20 at 12:19
What does the code have to do with this? And how can I (or anyone else) tell you so without the data? Please perform an elementary exploratory data analysis (EDA) before proceeding to ML modeling - the fact that it is considered a standard stage of the analysis (and taught in all curricula) is not a joke. — desertnaut, Dec 17 '20 at 12:43

score 0 · Accepted Answer · edited Dec 28 '20 at 07:35

I think your main mistake is that your are using KNeighborsClassifier, and your feature to predict seems to be continuous (G3 - final grade (numeric: from 0 to 20, output target)) and not categorical.

In this case, every single value of the "y" is taken as a different possible class or label. The message you obtain is saying that in your dataset (on the "y"), there are values that only appears one time. For example, the values 3, appears only one time inside your dataset. This is not an error, but indicates that the model won't work correctly or accurate.

After all, I strongly reccomend you to use the sklearn.neighbors.KNeighborsRegressor.

This is the Knn used for "continuous" variables and not classes. Using this model, you shouldn't have this problem anymore. The output value will be the mean between the number of nearest neighbors you defined.

With this simple changes, your problem will be solved.

This is it. The thing was that i needed to use the Classifier version of KNN for my project, so instead of using G3(final grade) y sorted them based on Fedu (Father's education), i ecoded each level of education (0-none 5-the highest), so i got 5 members and could do splits of 5 members each one. i still kinda dont know if that would be accurate or not, but at least i think i understood the problem. Thank you! — B1N4RY B1RD, Jan 02 '21 at 20:20

sklearn cross validation : The least populated class in y has only 1 members, which is less than n_splits=10

1 Answers1