0

I have a dataset with 100 samples, I want to split it into 75%, 25%, 25% for both Train Validate, and Test respectively, then I want to do that again with different ratios such as 80%, 10%, 10%.

For this purpose, I was using the code down, but I think that it's not splitting the data correctly on the second step, because it will split the data from 85% to (85% x 85%), and (15% x 15%).

My question is that:

Is there a nice clear way to do the splitting in the correct way for any given ratios?

from sklearn.model_selection import train_test_split
# Split Train Test Validate
X_, X_val, Y_, Y_val = train_test_split(X, Y, test_size=0.15, random_state=42)
X_train, X_test, Y_train, Y_test = train_test_split(X_, Y_, test_size=0.15, random_state=42)
Bilal
  • 3,191
  • 4
  • 21
  • 49

1 Answers1

1

You could always do it manually. A bit messy but you can create a function

def my_train_test_split(X, y, ratio_train, ratio_val, seed=42):
    idx = np.arange(X.shape[0])
    np.random.seed(seed)
    np.random.shuffle(idx)

    limit_train = int(ratio_train * X.shape[0])
    limit_val = int((ratio_train + ratio_val) * X.shape[0])

    idx_train = idx[:limit_train]
    idx_val = idx[limit_train:limit_val]
    idx_test = idx[limit_val:]

    X_train, y_train = X[idx_train], y[idx_train]
    X_val, y_val = X[idx_val], y[idx_val]
    X_test, y_test = X[idx_test], y[idx_test]

    return X_train, X_val, X_test, y_train, y_val, y_test

Ratio test is assumed to be 1-(ratio_train+ratio_val).

Bossipo
  • 178
  • 1
  • 10
  • I can just change my code: `test_size=0.15` to `ratio = 0.15`, `test_size=ratio/(1-ratio)` but I thought if there is a more elegant way to do it. – Bilal May 25 '20 at 06:50