Why two different AUC scores are produced when evaluated on same data and same algorithm

Question

I am working on a classification problem whose evaluation metric in ROC AUC. So far I have tried using xgb with different parameters. Here is the function which I used to sample the data. And you can find the relevant notebook here (google colab)

def get_data(x_train, y_train, shuffle=False):

  if shuffle:
    total_train = pd.concat([x_train, y_train], axis=1)

    # generate n random number in range(0, len(data))
    n = np.random.randint(0, len(total_train), size=len(total_train))
    x_train = total_train.iloc[n]
    y_train = total_train.iloc[n]['is_pass']
    x_train.drop('is_pass', axis=1, inplace=True)

    # keep the first 1000 rows as test data
    x_test = x_train.iloc[:1000]
    # keep the 1000 to 10000 rows as validation data
    x_valid = x_train.iloc[1000:10000]
    x_train = x_train.iloc[10000:]

    y_test = y_train[:1000]
    y_valid = y_train[1000:10000]
    y_train = y_train.iloc[10000:]

    return x_train, x_valid, x_test, y_train, y_valid, y_test

  else:
    # keep the first 1000 rows as test data
    x_test = x_train.iloc[:1000]
    # keep the 1000 to 10000 rows as validation data
    x_valid = x_train.iloc[1000:10000]
    x_train = x_train.iloc[10000:]

    y_test = y_train[:1000]
    y_valid = y_train[1000:10000]
    y_train = y_train.iloc[10000:]

    return x_train, x_valid, x_test, y_train, y_valid, y_test

Here are the two outputs that I get after running on shuffled and non shuffled data

AUC with shuffling:  0.9021756235738453
AUC without shuffling:  0.8025162142685565

Can you find out what's the issue here ?

Underfitting, perhaps? So the accuracy depends on random factors (such as order of evaluation in the training routine) instead of predictive parameters. — Emil Vikström, Jun 13 '18 at 07:50

score 2 · Accepted Answer · answered Jun 13 '18 at 09:39

2

The problem is that in your implementation of shuffling- np.random.randint generates random numbers, but they can be repeated, thus you have the same events appearing in your train and test+valid sets. You should use np.random.permutation instead (and consider to use np.random.seed to ensure reproducibility of the outcome).

Another note- you have very large difference in performance between training and validation/testing sets (the training shows almost perfect ROC AUC). I guess, this is due to too high max depth of the tree (14) that you allow for the size of the dataset (~60K) that you have in hand

P.S. Thanks for sharing collaboratory link- I was not aware of it, but it is very useful.

answered Jun 13 '18 at 09:39

Mischa Lisovyi

3,207
18
29

Seems like it could be the reason, I need to look into this – ksai Jun 13 '18 at 09:47
I've added a couple of checks after the shuffle data are created – Mischa Lisovyi Jun 13 '18 at 09:59
Tried using permutation but still there's a lot of difference in both AUC. Although your answer is accepted. If you have something else please update in the notebook. – ksai Jun 13 '18 at 14:10

Why two different AUC scores are produced when evaluated on same data and same algorithm

1 Answers1