1

I am working on a classification problem whose evaluation metric in ROC AUC. So far I have tried using xgb with different parameters. Here is the function which I used to sample the data. And you can find the relevant notebook here (google colab)

def get_data(x_train, y_train, shuffle=False):

  if shuffle:
    total_train = pd.concat([x_train, y_train], axis=1)

    # generate n random number in range(0, len(data))
    n = np.random.randint(0, len(total_train), size=len(total_train))
    x_train = total_train.iloc[n]
    y_train = total_train.iloc[n]['is_pass']
    x_train.drop('is_pass', axis=1, inplace=True)

    # keep the first 1000 rows as test data
    x_test = x_train.iloc[:1000]
    # keep the 1000 to 10000 rows as validation data
    x_valid = x_train.iloc[1000:10000]
    x_train = x_train.iloc[10000:]

    y_test = y_train[:1000]
    y_valid = y_train[1000:10000]
    y_train = y_train.iloc[10000:]

    return x_train, x_valid, x_test, y_train, y_valid, y_test

  else:
    # keep the first 1000 rows as test data
    x_test = x_train.iloc[:1000]
    # keep the 1000 to 10000 rows as validation data
    x_valid = x_train.iloc[1000:10000]
    x_train = x_train.iloc[10000:]

    y_test = y_train[:1000]
    y_valid = y_train[1000:10000]
    y_train = y_train.iloc[10000:]

    return x_train, x_valid, x_test, y_train, y_valid, y_test 

Here are the two outputs that I get after running on shuffled and non shuffled data

AUC with shuffling:  0.9021756235738453
AUC without shuffling:  0.8025162142685565

Can you find out what's the issue here ?

ksai
  • 987
  • 6
  • 18
  • Underfitting, perhaps? So the accuracy depends on random factors (such as order of evaluation in the training routine) instead of predictive parameters. – Emil Vikström Jun 13 '18 at 07:50

1 Answers1

2

The problem is that in your implementation of shuffling- np.random.randint generates random numbers, but they can be repeated, thus you have the same events appearing in your train and test+valid sets. You should use np.random.permutation instead (and consider to use np.random.seed to ensure reproducibility of the outcome).

Another note- you have very large difference in performance between training and validation/testing sets (the training shows almost perfect ROC AUC). I guess, this is due to too high max depth of the tree (14) that you allow for the size of the dataset (~60K) that you have in hand

P.S. Thanks for sharing collaboratory link- I was not aware of it, but it is very useful.

Mischa Lisovyi
  • 3,207
  • 18
  • 29