0

I am trying to find a reliable testing method to compute the error of my model / training parameters, but I am seeing weird results when I play with the train/test ratio.

When I change the ratio of my train/test data, the RMSE converges towards different values, see below: enter image description here

You can see the test ratio on the top-right corner.

Zoomed: enter image description here

After 50K iteration, it doesn't converge towards the same value.

Here is the code:

import time
import sys
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle

np.random.seed(int(time.time()))

def seed():
    np.random.randint(2**32-1)


n_scores_per_test = 50000
test_ratios = [.1, .2, .4, .6, .8]
model = Lasso(alpha=0.0005, random_state=seed(), tol=0.00001, copy_X=True)

# load our training data
train = pd.read_csv('train.csv')
X = train[['OverallCond']].values
y = np.log(train['SalePrice'].values)

# custom RMSE
def rmse(y_predicted, y_actual):
    tmp = np.power(y_actual - y_predicted, 2) / y_actual.size
    return np.sqrt(np.sum(tmp, axis=0))


for test_ratio in test_ratios:
    print 'Testing test ratio:', test_ratio

    scores = []
    avg_scores = []
    for i in range(n_scores_per_test):
        if i % 200 == 0:
            print i, '/', n_scores_per_test

        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=test_ratio, random_state=seed())

        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)

        scores.append(rmse(y_pred, y_test))
        avg_scores.append(np.array(scores).mean())

    plt.plot(avg_scores, label=str(test_ratio))

plt.legend(loc='upper right')
plt.show()

Any idea why they don't all converge nicely together?

See https://github.com/benji/rmse_convergence/

UPDATE:

  • Use selection=random for Lasso
  • using random_state in Lasso
  • using random_state train_test_split
  • removed redundant shuffle()
  • setting low tol on Lasso model
benji
  • 2,331
  • 6
  • 33
  • 62
  • 1
    The test set is different in each split to you cannot compare them. – Dr. Snoopy Feb 16 '18 at 08:50
  • train_test_split shuffles the data by default and you can also specify a random seed inside it, e.g. random_seed=n, where n is an integer. – KRKirov Feb 16 '18 at 10:46
  • @MatiasValdenegro you mean the test set size? why would the test size influence the model average error? – benji Feb 16 '18 at 16:24
  • @KRKirov the shuffle prior to the train_test_split() should do the job – benji Feb 16 '18 at 16:25
  • @ben, my point was that you probably don't need it. train_test_split shuffles anyway. The same applies to random_state - you can specify it inside train_test_split. – KRKirov Feb 16 '18 at 17:00
  • @ben The size also affects, I mean that each test set has different samples, so that is one reason why the test scores are different. If you want to compare, you have to use the same test set. – Dr. Snoopy Feb 16 '18 at 17:16
  • @MatiasValdenegro I'm sorry but I disagree. the same model is trying to fit the same data. over a large enough number of iterations over randomly selected subset it should eventually converge towards the same value. – benji Feb 16 '18 at 17:35
  • @ben No, you are assuming that they should converge to the same value, but that's not true if you change the training and test sets. You said its the same data, but its not. – Dr. Snoopy Feb 16 '18 at 17:36
  • @MatiasValdenegro The training set and the test set data are split taken from the same dataset X. Only the train/test ratio changes. As an analogy, if you take the average of means of N numbers taken from the same array, eventually they would all converge towards the same value regardless of N, which is the average of the whole array. – benji Feb 28 '18 at 01:06
  • @ben Your analogy is invalid in this case, a learning algorithm actually *learns* from the training data, you change the training data and what is learned changes. Now you add changes to the test set, which add more variability, so your results cannot be compared. If you cannot understand this then I recommend you to take a look at good scientific practices in Machine Learning. – Dr. Snoopy Feb 28 '18 at 01:14

0 Answers0