For this code:
#x_train, x_val, y_train, y_val=train_test_split(x,y,test_size=0.3, random_state=42)
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.3, random_state=42, stratify=y)
train = [x_train, y_train]
I get the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_28063/1294340868.py in <module>
1 #x_train, x_val, y_train, y_val=train_test_split(x,y,test_size=0.3, random_state=42)
----> 2 x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.3, random_state=42, stratify=y)
3 train = [x_train, y_train]
/SeaExp/mona/venv/dpcc/lib/python3.8/site-packages/sklearn/model_selection/_split.py in train_test_split(test_size, train_size, random_state, shuffle, stratify, *arrays)
2441 cv = CVClass(test_size=n_test, train_size=n_train, random_state=random_state)
2442
-> 2443 train, test = next(cv.split(X=arrays[0], y=stratify))
2444
2445 return list(
/SeaExp/mona/venv/dpcc/lib/python3.8/site-packages/sklearn/model_selection/_split.py in split(self, X, y, groups)
1598 """
1599 X, y, groups = indexable(X, y, groups)
-> 1600 for train, test in self._iter_indices(X, y, groups):
1601 yield train, test
1602
/SeaExp/mona/venv/dpcc/lib/python3.8/site-packages/sklearn/model_selection/_split.py in _iter_indices(self, X, y, groups)
1938 class_counts = np.bincount(y_indices)
1939 if np.min(class_counts) < 2:
-> 1940 raise ValueError(
1941 "The least populated class in y has only 1"
1942 " member, which is too few. The minimum"
ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.
I don't get an error if I use the line below instead:
x_train, x_val, y_train, y_val=train_test_split(x,y,test_size=0.3, random_state=42)
However, my intention is to do stratified 5-fold cross validation. How should I achieve that? I understand that for some target values in my y I do only have 1 item and stratification needs more than 1 item. How can I group these bins together?
Here's how my target y normalized histogram looks like:
Here's also the not normalized plot of y:
Here's a snippet of y's distribution. As you see, there's a lot of targets that only have 1 item in their bin.
Update: Please note that I found this code from verstack package, however, I do not know how to make a 5-fold cross validation with it.
x_train, x_val, y_train, y_val = scsplit(x, y, stratify = y, test_size=0.3, random_state=42)
train = [x_train, y_train]