How to avoid overfitting on a simple feed forward network

Question

Using the pima indians diabetes dataset I'm trying to build an accurate model using Keras. I've written the following code:

# Visualize training history
from keras import callbacks
from keras.layers import Dropout

tb = callbacks.TensorBoard(log_dir='/.logs', histogram_freq=10, batch_size=32,
                           write_graph=True, write_grads=True, write_images=False,
                           embeddings_freq=0, embeddings_layer_names=None, embeddings_metadata=None)
# Visualize training history
from keras.models import Sequential
from keras.layers import Dense
import matplotlib.pyplot as plt
import numpy

# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)
# load pima indians dataset
dataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",")
# split into input (X) and output (Y) variables
X = dataset[:, 0:8]
Y = dataset[:, 8]
# create model
model = Sequential()
model.add(Dense(12, input_dim=8, kernel_initializer='uniform', activation='relu', name='first_input'))
model.add(Dense(500, activation='tanh', name='first_hidden'))
model.add(Dropout(0.5, name='dropout_1'))
model.add(Dense(8, activation='relu', name='second_hidden'))
model.add(Dense(1, activation='sigmoid', name='output_layer'))

# Compile model
model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

# Fit the model
history = model.fit(X, Y, validation_split=0.33, epochs=1000, batch_size=10, verbose=0, callbacks=[tb])
# list all data in history
print(history.history.keys())
# summarize history for accuracy
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

After several tries, I've added dropout layers in order to avoid overfitting, but with no luck. The following graph shows that the validation loss and training loss gets separate at one point.

What else could I do to optimize this network?

UPDATE: based on the comments I got I've tweaked the code like so:

model = Sequential()
model.add(Dense(12, input_dim=8, kernel_initializer='uniform', kernel_regularizer=regularizers.l2(0.01),
                activity_regularizer=regularizers.l1(0.01), activation='relu',
                name='first_input'))  # added regularizers
model.add(Dense(8, activation='relu', name='first_hidden'))  # reduced to 8 neurons
model.add(Dropout(0.5, name='dropout_1'))
model.add(Dense(5, activation='relu', name='second_hidden'))
model.add(Dense(1, activation='sigmoid', name='output_layer'))

Here are the graphs for 500 epochs

did my solution work for you? let me know if you need anymore help. — CoolPenguin, Jul 04 '17 at 19:02
The problem seems to be solved - you're not really overfitting anymore. It's just that your model isnt learning as much as you'd like it to. There's a couple things you can do t fix that - decrease the regularization and dropout a little and find the sweet spot or you can try adjusting your learning rate I.e. Exponentially decay it — CoolPenguin, Jul 05 '17 at 10:34
And your dataset is pretty small so I'm not even sure of the accuracy it's possible to get without overfitting — CoolPenguin, Jul 05 '17 at 10:37
The highest accuracy is 80.21 according to this paper http://www.yildiz.edu.tr/~tulay/publications/Icann-Iconip2003-2.pdf — Shlomi Schwartz, Jul 05 '17 at 19:58
Test data better than training data??? That sounds really weird. — Daniel Möller, Jul 07 '17 at 21:45
It's obvious that something is wrong... But I'm not sure what — Shlomi Schwartz, Jul 08 '17 at 06:12
Test being better than training is absolutely common when using dropout. — P-Gn, Jul 08 '17 at 08:34

Vijay Mariappan · Accepted Answer · 2017-07-08T14:40:09.057

The first example gave a validation accuracy > 75% and the second one gave an accuracy of < 65% and if you compare the losses for epochs below 100, its less than < 0.5 for the first one and the second one was > 0.6. But how is the second case better?.

The second one to me is a case of under-fitting: the model doesnt have enough capacity to learn. While the first case has a problem of over-fitting because its training was not stopped when overfitting started (early stopping). If the training was stopped at say 100 epoch, it would be a far better model compared between the two.

The goal should be to obtain small prediction error in unseen data and for that you increase the capacity of the network till a point beyond which overfitting starts to happen.

So how to avoid over-fitting in this particular case? Adopt early stopping.

CODE CHANGES: To include early stopping and input scaling.

 # input scaling
 scaler = StandardScaler()
 X = scaler.fit_transform(X)

 # Early stopping  
 early_stop = EarlyStopping(monitor='val_loss', min_delta=0, patience=3, verbose=1, mode='auto')

 # create model - almost the same code
 model = Sequential()
 model.add(Dense(12, input_dim=8, activation='relu', name='first_input'))
 model.add(Dense(500, activation='relu', name='first_hidden'))
 model.add(Dropout(0.5, name='dropout_1'))
 model.add(Dense(8, activation='relu', name='second_hidden'))
 model.add(Dense(1, activation='sigmoid', name='output_layer')))

 history = model.fit(X, Y, validation_split=0.33, epochs=1000, batch_size=10, verbose=0, callbacks=[tb, early_stop])

The Accuracy and loss graphs:

That was my conclusion too, but I was wondering if I can do better than 75% — Shlomi Schwartz, Jul 08 '17 at 13:13
Added the code, similar to yours but with input scaling, gives a score close to `85%`. And other thing is the default params in `keras` are set to the `best practices`, so unless you have a specific reason, its better to leave it untouched. — Vijay Mariappan, Jul 08 '17 at 14:44

CoolPenguin · Answer 2 · 2017-07-04T15:24:04.283

First, try adding some regularization (https://keras.io/regularizers/) like with this code:

model.add(Dense(12, input_dim=12,
            kernel_regularizer=regularizers.l2(0.01),
            activity_regularizer=regularizers.l1(0.01)))

Also, make sure to decrease your network size i.e. you don't need a hidden layer of 500 neurons - try just taking that out to decrease the representation power and maybe even another layer if it's still overfitting. Also, only use relu activation. Maybe also try increasing your dropout rate to something like 0.75 (although it's already high). You probably also don't need to run it for so many epochs - it will just begin to overfit after long enough.

Thanks, I'll try and pay back – Shlomi Schwartz Jul 05 '17 at 04:25 — Shlomi Schwartz, Jul 05 '17 at 04:25

score 2 · Answer 3 · answered Jul 04 '17 at 15:06

For a dataset like the Diabetes one you can use a much simpler network. Try to reduce the neurons in your second layer. (Is there a specific reason why you chose tanh as the activation there?).

In addition you simply can add an EarlyStopping callback to your training: https://keras.io/callbacks/

How to avoid overfitting on a simple feed forward network

3 Answers3

Linked