Is there a python function for splitting the dataset?

Question

I am trying figure out how I can split my train and test set in a way that, one of the features' mean, for ex. price will be constant for both train and test data. How can I do it? Thanks!

score 0 · Answer 1 · answered May 24 '22 at 15:27

This is usually not necessary, because with reasonably large training and test sets sampled randomly, the means of any numerical feature are not expected to differ much between training and test set (due to the law of large numbers).

However, you can make the two means agree even more closely (on average) by using stratified sampling to draw the test set. Here is one simple way to do this:

Sort the price column (including the indices) by the price values.
Divide these sorted indices into consecutive groups (a.k.a strata), one group for every row that the test set shall have.
Sample one index uniformly at random from each group. This gives you the indices for the test set.

Comparing the two approaches:

import numpy as np
import pandas as pd

np.random.seed(42)

N = 10_000  # number of rows overall
PROP_TEST = 0.2  # proportion of rows that should end up in the test set

n_test = int(N * PROP_TEST)  # number of rows that should end up in the test set
n_strat = int(1 / PROP_TEST)  # number of rows per stratum

df = pd.DataFrame({'price': np.random.normal(loc=50, scale=10, size=N)})

print('Random Sampling')
ix_test = np.random.choice(df.index, size=n_test, replace=False)
print(f"mean price in test set:     {df.loc[ix_test, 'price'].mean():.3f}")
ix_train = np.array(set(df.index) - set(ix_test))
print(f"mean price in training set: {df.loc[ix_train, 'price'].mean():.3f}")
print()

print('Stratified Sampling')
price_sorted = df['price'].sort_values()
ix_partition = np.split(price_sorted.index, n_test)
sub_ix_test = np.random.choice(range(n_strat), size=n_test, replace=True)
ix_test = [part[sub_ix]  for part, sub_ix in zip(ix_partition, sub_ix_test)]
print(f"mean price in test set:     {df.loc[ix_test, 'price'].mean():.3f}")
ix_train = np.array(set(df.index) - set(ix_test))
print(f"mean price in training set: {df.loc[ix_train, 'price'].mean():.3f}")

Random Sampling
mean price in test set:     50.015
mean price in training set: 49.970

Stratified Sampling
mean price in test set:     49.976
mean price in training set: 49.979

This looks like a fairly typical result to me, but you may want to try different values for N, the random seed, and the price distribution, to see how those affect the comparison.

Is there a python function for splitting the dataset?

1 Answers1