2

I am trying to train my 6000 train dataset and 1000 validation dataset but I have a problem: the program just freezes and hangs during training without any error message .

1970/6000 [========>.....................] - ETA: 1:50:11 - loss: 1.2256 - accuracy: 0.5956
1971/6000 [========>.....................] - ETA: 1:50:08 - loss: 1.2252 - accuracy: 0.5958
1972/6000 [========>.....................] - ETA: 1:50:08 - loss: 1.2248 - accuracy: 0.5960
1973/6000 [========>.....................] - ETA: 1:50:06 - loss: 1.2245 - accuracy: 0.5962
1974/6000 [========>.....................] - ETA: 1:50:04 - loss: 1.2241 - accuracy: 0.5964
1975/6000 [========>.....................] - ETA: 1:50:02 - loss: 1.2243 - accuracy: 0.5961
1976/6000 [========>.....................] - ETA: 1:50:00 - loss: 1.2239 - accuracy: 0.5963
1977/6000 [========>.....................] - ETA: 1:49:58 - loss: 1.2236 - accuracy: 0.5965
1978/6000 [========>.....................] - ETA: 1:49:57 - loss: 1.2241 - accuracy: 0.5962
1979/6000 [========>.....................] - ETA: 1:49:56 - loss: 1.2237 - accuracy: 0.5964
1980/6000 [========>.....................] - ETA: 1:49:55 - loss: 1.2242 - accuracy: 0.5961
1981/6000 [========>.....................] - ETA: 1:49:53 - loss: 1.2252 - accuracy: 0.5958
1982/6000 [========>.....................] - ETA: 1:49:52 - loss: 1.2257 - accuracy: 0.5955

I wait 5-6 minutes but it seem nothing happen. I try to solved like

  1. Change steps_per_epoch to 100 and increase epoch to 20
  2. I think it a problem of function ReduceLROnPlateau so I will add cooldown =1
    but 2 solution did not solve this problem

Hardware configuration:

  • I5-8300h
  • Gtx 1060 6GB

Dependencies:

  1. Keras 2.3.1
  2. TensorFlow 2.0.0(GPU-Version)

The code is provided below:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import keras
import tensorflow as tf
from skimage import exposure, color
from keras.optimizers import Adam
from tqdm import tqdm
from keras.models import Model
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D,Convolution2D
from keras.layers import Activation, Dropout, Flatten, Dense
from keras.callbacks import EarlyStopping, ReduceLROnPlateau, ModelCheckpoint, Callback
from keras import regularizers
from keras.applications.densenet import DenseNet121
from keras_preprocessing.image import ImageDataGenerator
from sklearn.utils import class_weight
from collections import Counter

config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth=True
session = tf.compat.v1.Session(config=config)


# Histogram equalization
def HE(img):
    img_eq = exposure.equalize_hist(img)
    return img_eq



def plotImages(images_arr):
    fig, axes = plt.subplots(1, 5, figsize=(20,20))
    axes = axes.flatten()
    for img, ax in zip( images_arr, axes):
        ax.imshow(img)
        ax.axis('off')
    plt.tight_layout()
    plt.show()

train_datagen = ImageDataGenerator(
    rescale=1. / 255,
    rotation_range=40,
    zoom_range=0.2,
    horizontal_flip=True,
    fill_mode='nearest',
    preprocessing_function=HE,
)

validation_datagen = ImageDataGenerator(
    rescale=1./255
)
test_datagen = ImageDataGenerator(
    rescale=1./255
)

#get image and label with augmentation
train = train_datagen.flow_from_directory(
'train/train_deep/',
target_size=(224,224),
class_mode='categorical',
shuffle=False,
batch_size = 20,
)

test = test_datagen.flow_from_directory(
    'test_deep/',
    batch_size=1,
    target_size = (224,224),

)

val = validation_datagen.flow_from_directory(
    'train/validate_deep/',
    target_size=(224,224),
    batch_size = 20,
)
#Training
X_train, y_train = next(train)
class_names = ['No DR', 'Mild', 'Moderate', 'Severe', 'Proliferative DR']
counter = Counter(train.classes)
class_weights = class_weight.compute_class_weight(
               'balanced',
                np.unique(train.classes),
                train.classes)

#X_test , y_test = next(test)
#X_test=np.reshape(X_test,(X_test.shape[0],X_test.shape[1],X_test.shape[2]))
#Training parameter
batch_size =32
Epoch = 2


model = DenseNet121(include_top=True, weights=None, input_tensor=None, input_shape=(224,224,3), pooling=None, classes=5)
model.compile(loss='categorical_crossentropy',
              optimizer=Adam(learning_rate=0.01),
              metrics=['accuracy'])
model.summary()
filepath="weights-improvement-{epoch:02d}-{val_loss:.2f}.hdf5"
checkpointer = ModelCheckpoint(filepath,monitor='val_loss', verbose=1, save_best_only=True,save_weights_only=True)
lr_reduction = ReduceLROnPlateau(monitor='val_loss', patience=5, verbose=2, factor=0.2,cooldown=1)
callbacks_list = [checkpointer, lr_reduction]
#Validation
X_val , y_val = next(val)

#history = model.fit(X_train,y_train,epochs=Epoch,validation_data = (X_val,y_val))

history = model.fit_generator(
    train,
    epochs=Epoch,
    steps_per_epoch=6000,
    class_weight=class_weights,
    validation_data=val,
    validation_steps=1000,
    use_multiprocessing = False,
    max_queue_size=100,
    workers = 1,
    callbacks=callbacks_list
)


# Score trained model.
scores = model.evaluate(X_val, y_val, verbose=1)
print('Test loss:', scores[0])
print('Test accuracy:', scores[1])


#predict
test.reset()
pred=model.predict_generator(test,
steps=25,)

print(pred)
for i in pred:
    print(np.argmax(i))
Timbus Calin
  • 13,809
  • 5
  • 41
  • 59
Lusus
  • 23
  • 1
  • 4

1 Answers1

6

This code would work well if you used Keras < 2.0.0 (I do not recommend that you use old versions).

Your error comes from the fact that you are using Keras > 2.0.0 or Keras inside TensorFlow.

The exact error from your code springs from these lines:

history = model.fit_generator( #change `.fit_generator() to .fit()`
    train,
    epochs=Epoch,
    steps_per_epoch=6000, #change this to 6000//32
    class_weight=class_weights,
    validation_data=val,
    validation_steps=1000, #change this to 1000//32
    use_multiprocessing = False,
    max_queue_size=100,
    workers = 1,
    callbacks=callbacks_list
)

The parameters "steps_per_epoch" and "validation_steps" have to be equal to the length of the dataset divided by the batch size.

Timbus Calin
  • 13,809
  • 5
  • 41
  • 59
  • I edit a steps_per_epoch and validation_steps to equal to my length of dataset and divided by batch size but it freeze again – Lusus Nov 25 '19 at 09:19
  • Okay... strange. What if you reduce the max_queue_size to 20 for example? – Timbus Calin Nov 25 '19 at 09:28
  • Also, where does it freeze again? Does the progress bar show you that you are doing int(6000/32) steps in the training phase? – Timbus Calin Nov 25 '19 at 09:29
  • I change epoch to 10 and steps_per_epoch in progress bar is 135/187 in epoch 9/10 – Lusus Nov 25 '19 at 09:33
  • What version of Keras and TensorFlow do you use? Try updating to Keras 2.2.4 and tell me if the problem still persists. – Timbus Calin Nov 25 '19 at 09:42
  • Tensorflow 2.0.0 and Keras 2.3.1 (sorry for say keras2.0) – Lusus Nov 25 '19 at 09:55
  • Please downgrade to TF 1.14. Please use every Keras import like "from tensorflow.keras.layers" import Conv2D. Tell me after you do this replacement everywhere if it works. – Timbus Calin Nov 25 '19 at 09:57
  • It's work but I confuse `188/187 [==============================] - 387s 2s/step - loss: 1.7421 - acc: 0.2242 - val_loss: 299551.9195 - val_acc: 0.1375` why they show 188/187 and val_loss too high and different from loss and when I predict it show all prediction output is 3 what should I do if I want to get a true predict value ,Increase Epoch? – Lusus Nov 25 '19 at 12:07
  • 1
    It shows 188 because you rounded up instead of down when dividing the length to the batch size(try using int(length/batch_size). Also, as for the dataset there should be a different question(it may be imbalanced dataset, underfitting etc., needing to train for more epochs etc.). Since downgrading to TF 1.14 helped you solve your first issue, please mark the question as solved. – Timbus Calin Nov 25 '19 at 12:12