Tensorflow learning stucks at a random step and produces a lot of warnings

Question

I set

tf.config.experimental.set_device_policy('warn')

and when I call fit() function the learning process might get stuck on a random step and epoch while producing these warnings each step:

W tensorflow/core/common_runtime/eager/execute.cc:169] before computing Shape input #0 was expected to be on /job:localhost/replica:0/task:0/device:GPU:0 but is actually on /job:localhost/replica:0/task:0/device:CPU:0 (operation running on /job:localhost/replica:0/task:0/device:GPU:0). This triggers a copy which can be a performance bottleneck.

tf.compat.v1.disable_eager_execution()

tends to fix the problem but reduces some functionality that I need. Is there a way to fix it without disabling eager execution?

EDIT 1 It seems that disabling eager execution doesn't help

EDIT 2

import tensorflow as tf
from tensorflow import keras
from keras import layers, callbacks

path = 'Dogs/'

train_data = keras.utils.image_dataset_from_directory(path, subset= 'training', validation_split= 0.2, seed= 478, label_mode= 'categorical', batch_size= 32)
test_data = keras.utils.image_dataset_from_directory(path, subset= 'validation', validation_split= 0.2, seed= 478, label_mode= 'categorical', batch_size= 32)

def standartize_img(image, label):
   image = image / 255.0
   return image, label

AUTOTUNE = tf.data.AUTOTUNE
train_data = 
train_data.map(standartize_img).cache().prefetch(AUTOTUNE)
test_data = test_data.map(standartize_img).cache().prefetch(AUTOTUNE)

model = keras.Sequential([
layers.Resizing(128,128, interpolation= 'nearest'),    
layers.Conv2D(filters= 32, activation= 'relu', padding= 'same', strides= 1, kernel_size= 3),
layers.MaxPooling2D(),

layers.Conv2D(filters= 64, activation= 'relu', padding= 'same', strides= 1, kernel_size= 3),
layers.Conv2D(filters= 64, activation= 'relu', padding= 'same', strides= 1, kernel_size= 3),
layers.MaxPooling2D(),

layers.Conv2D(filters= 128, activation= 'relu', padding= 'same', strides= 1, kernel_size= 3),
layers.Conv2D(filters= 128, activation= 'relu', padding= 'same', strides= 1, kernel_size= 3),
layers.MaxPooling2D(),

layers.Flatten(),
layers.Dense(32, activation= 'relu', kernel_regularizer= keras.regularizers.L2(0.001)),
layers.Dropout(0.4),
layers.Dense(10, activation= 'softmax')
])

model.compile(optimizer= 'adam', loss= keras.losses.CategoricalCrossentropy(), metrics= ['categorical_crossentropy','accuracy'])

early_stop = callbacks.EarlyStopping(min_delta= 0.001, patience= 15, restore_best_weights= True, monitor= 'val_categorical_crossentropy')

history= model.fit(train_data, validation_data= test_data, epochs= 300, callbacks= [early_stop])

EDIT 3 Logs- https://github.com/Delinester/logs

EDIT 4 - PC Specs:

Windows 10

Intel Core i3-8100

AMD Radeon RX580 (Running with Tensorflow DirectML plugin)

16 GB of RAM

How big is the data you are trying to fit on? Maybe the GPU does not have enough RAM to fit all the data? — Plagon, Jan 31 '23 at 08:30
@Plagon There are around 2000 images. My GPU has 8 GB of RAM — Delinester, Jan 31 '23 at 08:38
Can you include the code on how you initailize your GPU/CUDA and how you load your images to the GPU? — Jason Chia, Jan 31 '23 at 09:55
Have a suspicion that your images are actually not loaded to your GPU tensor whereas your model is...Can you add the debug logs from using tf.debugging.set_log_device_placement(True) — Jason Chia, Jan 31 '23 at 10:22
@JasonChia added output after executing model.fit() - https://github.com/Delinester/logs — Delinester, Jan 31 '23 at 10:49
Sorry I can't contribute more to this. The problem doesn't seem to be replicable on my end. You get some interesting NUMA warnings but this points nowhere. Could be that the execution is just slow? There were no error traces right? I think the warning can be ignored but try seeing if the model is too complicated. Use a smaller dummy model and see if the problem persists. Also your system and environment specs would probably be helpful to those trying to help. — Jason Chia, Jan 31 '23 at 14:47
@JasonChia Yes, there weren't any errors. I will try experimenting further with models, thank you for your time — Delinester, Jan 31 '23 at 18:10

Tensorflow learning stucks at a random step and produces a lot of warnings

0 Answers0