2

I am new to machine learning.

I have a continuous dataset. I am trying to model the target label using several features. I utilize the train_test_split function to separate the train and the test data. I am training and testing the model using the code below:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = Sequential()
model.add(Dense(128, input_dim=X.shape[1], kernel_initializer = 'normal', activation='relu'))
model.add(Dense(1, kernel_initializer = 'normal'))
hist = model.fit(X_train.values, y_train.values, validation_data=(X_test.values,y_test.values), epochs=200, batch_size=64, verbose=1) 

I can get good results when I use X_test and y_test for validation data:

https://drive.google.com/open?id=0B-9aw4q1sDcgNWt5TDhBNVZjWmc

However, when I use this model to predict another data (X_real, y_real) (which are not so different from the X_test and y_test except that they are not randomly chosen by train_test_split) I get bad results:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = Sequential()
model.add(Dense(128, input_dim=X.shape[1], kernel_initializer = 'normal', activation='relu'))
model.add(Dense(1, kernel_initializer = 'normal'))
hist = model.fit(X_train.values, y_train.values, validation_data=(X_real.values,y_real.values), epochs=200, batch_size=64, verbose=1) 

https://drive.google.com/open?id=0B-9aw4q1sDcgYWFZRU9EYzVKRFk

Is it an issue of overfitting? If it is so, why does my model work ok with the X_test and y_test generated by train_test_split?

Yahya
  • 35
  • 3

2 Answers2

1

Seems that your "real data" differs from your train and test data. Why do you have "real" and "training" data in the first place?

My approach would be:

1: Mix up all Data you have

2: Devide your Data randomly in 3 sets (train, test and validate)

3: use train and test like you do it now and optimize your classifier

4: When it's good enough validate the classifier with your validation set to make sure no overfitting occurs.

Florian H
  • 3,052
  • 2
  • 14
  • 25
  • Sorry I am not good at the terminology. You can count my definition "real_data" as "validation data". Why do we need a validation data? Isn't "test data" enough? How come does a model fail with "validation data" although it is ok with the "test data"? I know that the results with "test data" have high variance. But in my case, it is ok with the test data every time I run the code, but failing with the validation data. Btw. my dataset is of a time-series. – Yahya Oct 18 '17 at 06:15
  • Your guess was an overfitting issue. With other words you train your classifier untill it fits to your test data. To be sure that your good result does not only work for your test data you can keep some unseen data to make 100% sure that your classifier is as good as you think after test results (thats your validation data). But the obvious reason why your validation is bad and your test is good would be that those two datasets are different of each other. So the question is what kind of data do you have and even more important: what is "real data" and why you call it real? – Florian H Oct 18 '17 at 08:01
  • You are right. It seems validation data is different from the training and testing one. – Yahya Oct 19 '17 at 08:26
1

If you have less data then I would suggest you to try a different algorithm. Neural networks generally need a lot of data to get the weights right. Also, your real data doesn't seem to belong to the same distribution as the train and test data. Don't keep anything hidden, shuffle everything and use Train/Validation/Test splits.

pissall
  • 7,109
  • 2
  • 25
  • 45