-2

For this code:

#x_train, x_val, y_train, y_val=train_test_split(x,y,test_size=0.3, random_state=42)
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.3, random_state=42, stratify=y)
train = [x_train, y_train] 

I get the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_28063/1294340868.py in <module>
      1 #x_train, x_val, y_train, y_val=train_test_split(x,y,test_size=0.3, random_state=42)
----> 2 x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.3, random_state=42, stratify=y)
      3 train = [x_train, y_train]

/SeaExp/mona/venv/dpcc/lib/python3.8/site-packages/sklearn/model_selection/_split.py in train_test_split(test_size, train_size, random_state, shuffle, stratify, *arrays)
   2441         cv = CVClass(test_size=n_test, train_size=n_train, random_state=random_state)
   2442 
-> 2443         train, test = next(cv.split(X=arrays[0], y=stratify))
   2444 
   2445     return list(

/SeaExp/mona/venv/dpcc/lib/python3.8/site-packages/sklearn/model_selection/_split.py in split(self, X, y, groups)
   1598         """
   1599         X, y, groups = indexable(X, y, groups)
-> 1600         for train, test in self._iter_indices(X, y, groups):
   1601             yield train, test
   1602 

/SeaExp/mona/venv/dpcc/lib/python3.8/site-packages/sklearn/model_selection/_split.py in _iter_indices(self, X, y, groups)
   1938         class_counts = np.bincount(y_indices)
   1939         if np.min(class_counts) < 2:
-> 1940             raise ValueError(
   1941                 "The least populated class in y has only 1"
   1942                 " member, which is too few. The minimum"

ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

I don't get an error if I use the line below instead:

x_train, x_val, y_train, y_val=train_test_split(x,y,test_size=0.3, random_state=42)

However, my intention is to do stratified 5-fold cross validation. How should I achieve that? I understand that for some target values in my y I do only have 1 item and stratification needs more than 1 item. How can I group these bins together?

Here's how my target y normalized histogram looks like: enter image description here

Here's also the not normalized plot of y: enter image description here

Here's a snippet of y's distribution. As you see, there's a lot of targets that only have 1 item in their bin. enter image description here

Update: Please note that I found this code from verstack package, however, I do not know how to make a 5-fold cross validation with it.

x_train, x_val, y_train, y_val = scsplit(x, y, stratify = y, test_size=0.3, random_state=42)
train = [x_train, y_train] 
Mona Jalal
  • 34,860
  • 64
  • 239
  • 408

1 Answers1

1

You cannot perform a stratified split as there is value that are present only once so they cannot have an even repartitions in train and test set.

Once solution would be to bin this continuous variable into intervals using KBinsDiscretizer and perform the stratified split on it as follows:

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import KBinsDiscretizer

X, y = make_regression()
y_discretized = KBinsDiscretizer(n_bins=10,
                                 encode='ordinal',
                                 strategy='uniform').fit_transform(y.reshape(-1, 1))

X_train, X_val, y_train, y_val = train_test_split(X, y, 
                                                  test_size=0.3,
                                                  random_state=42,
                                                  stratify=y_discretized)
Antoine Dubuis
  • 4,974
  • 1
  • 15
  • 29