How to know if tensorflow 2 training is going forward, is stuck or will finish?

Question

without gpus

I achieve to start the training process of a simple object detection, using:

tensorflow 2
model_main_tf2.py
pipeline.config: BATCH_SIZE=16 NUM_CLASSES=1 NUM_STEPS=200000
model_main_tf2.py parameters: NUM_WORKERS=1
model: ssd_mobilenet_v2_320x320_coco17_tpu-8

In the log I saw something like this:

INFO:tensorflow:Step 100 per-step time 1.609s
INFO:tensorflow:Step 200 per-step time 1.258s
INFO:tensorflow:Step 300 per-step time 1.253s
...
INFO:tensorflow:Step 800 per-step time 1.257s
INFO:tensorflow:Step 900 per-step time 1.253s
INFO:tensorflow:Step 1000 per-step time 1.252s
INFO:tensorflow:Step 1100 per-step time 1.247s
INFO:tensorflow:Step 1200 per-step time 1.252s
...
INFO:tensorflow:Step 4300 per-step time 1.248s
INFO:tensorflow:Step 4400 per-step time 1.271s
...
INFO:tensorflow:Step 4800 per-step time 1.258s
INFO:tensorflow:Step 4900 per-step time 1.259s
INFO:tensorflow:Step 5000 per-step time 1.251s
...

Also each entry of the previous log, shows several parameters:

I0119 02:50:17.511021 140379542857536 model_lib_v2.py:708] {
 'Loss/classification_loss': 0.1677758,
 'Loss/localization_loss': 0.0800046,
 'Loss/regularization_loss': 0.15408026,
 'Loss/total_loss': 0.40186065,
 'learning_rate': 0.79281646
}

After 12 hours, it never ends. The maximum reached step was 10000

As I'm not using gpus, so the training step is slow and I don't know when it will end. At least, it is running. I tried the same code with another models and parameters and ends abruptly.

with gpus

I was able to find a cloud machine with this hardware

Tesla V100
2V100.10V
2x NVidia Tesla V100
10 CPU
45GB RAM
32GB GPU RAM

With cuda and cudnn successfully installed and these versions:

python version: 3.8.10
tensorflow version: 2.7.0
gpu_device_name: /device:GPU:0
Num GPUs Available:  2
Num GPUs Available(exp):  2

And I ran my train with these parameters

tensorflow 2
model_main_tf2.py
pipeline.config: BATCH_SIZE=16 NUM_CLASSES=1 NUM_STEPS=200000
model_main_tf2.py parameters: NUM_WORKERS=2
model: ssd_mobilenet_v2_320x320_coco17_tpu-8

The start time was 10:30:02 and here my annotations:

10:38 3800 step
10:40 4500 step
10:43 6300 step
10:46 7700 step
10:52 11000 step
11:00 15100 step
11:13 22500 step
11:28 30700 step
12:00 48000 step
12:15 54400 step

And this log in the 54400 step

I0129 12:15:10.217693 139823678080832 model_lib_v2.py:708] {'Loss/classification_loss': 0.28390533,
 'Loss/localization_loss': 0.26865262,
 'Loss/regularization_loss': 57.277615,
 'Loss/total_loss': 57.83017,
 'learning_rate': 0.0}

As we see, with gpus in just 1.5 hours the step 48000 was reached. But it never ends. In the cloud any second is billed, so I need to know how much longer I have to wait

Question

I don't know if the log entry INFO:tensorflow:Step xyz per-step time 1.251s, is related to the elapsed time of training process, num_steps parameters or something.

How to know if tensorflow 2 training is going forward, is stuck, or will finish?

Research:

There's generally an objective function associated with a training model. For example in a classification problem categorical loss is a possible objective function. The concept of going forward, stuck, etc. are generally defined in terms of such function. If the value is improving , e.g., loss is lowering with each passing epoch we could say it is "going forward". In the end this is fairly arbitrary and can be set to anything you want. — MYousefi, Jan 19 '22 at 03:50
Thanks @MYousefi. I added several loss parameters showed in the log. Do you know which is related to the loss lowering that you mentioned? — JRichardsz, Jan 19 '22 at 04:29
This should be explained/found somewhere in the model. Is there documentation on it? My first guess would be `Loss/total_loss` — MYousefi, Jan 19 '22 at 04:43
@JRichardsz I think such a question is more fit on [tf-forum](https://discuss.tensorflow.org/). — Innat, Jan 19 '22 at 04:54

score 0 · Answer 1 · answered Jan 29 '22 at 20:45

If you are using TFOD (TensorFlow Object Detection) API then you can limit the total number of steps to be performed from the config file. If you are using SSD Mobilenet trained on COCO dataset, then the config file name should be like: ssd_mobilenet_v2_coco.config. Here you will find the option:

Note: The below line limits the training process to 200K steps, which we empirically found to be sufficient enough to train the pets dataset. This effectively bypasses the learning rate schedule (the learning rate will never decay). Remove the below line to train indefinitely.

num_steps: 200000 # change it accordingly.

You may also try Top-K-Categorical-Accuracy which will compute how often targets are in the top K predictions (I am not quite sure if there is an existing API to get it in TFOD, as I didn't use it in TFOD). This will give you more insight if it is still worth it to train further. Normally, when we say accuracy, we mean the Top-1-Accuracy, that means, we only consider a prediction is correct if the correct class is predicted with highest probability. But in the situations like what you are facing it is important to judge even a slightest change model's performance.

Say, you are working on a classification problem on 10 classes. After completing 100 epochs you got the following predictions on validation set (I am showing the predictions sorted by their softmax probabilities):

Input Number    Predictions                        Top Prediction   Correct Label
0               (Class_0, Class_2, Class_1, ...)      Class_0         Class_0
1               (Class_1, Class_4, Class_2, ...)      Class_1         Class_2
2               (Class_1, Class_5, Class_2, ...)      Class_1         Class_1
3               (Class_7, Class_3, Class_1, ...)      Class_7         Class_3
....
....
....

Say after 150 epochs the predictions on validation set is like this:

Input Number    Predictions                        Top Prediction   Correct Label
0               (Class_0, Class_2, Class_1, ...)      Class_0         Class_0
1               (Class_1, Class_2, Class_4, ...)      Class_1         Class_2
2               (Class_1, Class_5, Class_2, ...)      Class_1         Class_1
3               (Class_7, Class_3, Class_1, ...)      Class_7         Class_3
....
....
....

If you notice carefully, in the example 1 where the correct label is Class_2, in prediction column of that row the correct prediction has come to the second position from the third position. Although in both cases, the top-most prediction is still the same and hence, the ordinary accuracy calculation will give you the same value and you will feel no progress is happening. But with Top-K-Classification-Accuracy (typically k = 5) you will be able to know if the time you spend on training will bring you anything meaningful outcome and pull the correct class in the first position.

How to know if tensorflow 2 training is going forward, is stuck or will finish?

without gpus

with gpus

Question

Research:

1 Answers1