without gpus
I achieve to start the training process of a simple object detection, using:
- tensorflow 2
- model_main_tf2.py
- pipeline.config:
BATCH_SIZE=16 NUM_CLASSES=1 NUM_STEPS=200000
- model_main_tf2.py parameters:
NUM_WORKERS=1
- model: ssd_mobilenet_v2_320x320_coco17_tpu-8
In the log I saw something like this:
INFO:tensorflow:Step 100 per-step time 1.609s
INFO:tensorflow:Step 200 per-step time 1.258s
INFO:tensorflow:Step 300 per-step time 1.253s
...
INFO:tensorflow:Step 800 per-step time 1.257s
INFO:tensorflow:Step 900 per-step time 1.253s
INFO:tensorflow:Step 1000 per-step time 1.252s
INFO:tensorflow:Step 1100 per-step time 1.247s
INFO:tensorflow:Step 1200 per-step time 1.252s
...
INFO:tensorflow:Step 4300 per-step time 1.248s
INFO:tensorflow:Step 4400 per-step time 1.271s
...
INFO:tensorflow:Step 4800 per-step time 1.258s
INFO:tensorflow:Step 4900 per-step time 1.259s
INFO:tensorflow:Step 5000 per-step time 1.251s
...
Also each entry of the previous log, shows several parameters:
I0119 02:50:17.511021 140379542857536 model_lib_v2.py:708] {
'Loss/classification_loss': 0.1677758,
'Loss/localization_loss': 0.0800046,
'Loss/regularization_loss': 0.15408026,
'Loss/total_loss': 0.40186065,
'learning_rate': 0.79281646
}
After 12 hours, it never ends. The maximum reached step was 10000
As I'm not using gpus, so the training step is slow and I don't know when it will end. At least, it is running. I tried the same code with another models and parameters and ends abruptly.
with gpus
I was able to find a cloud machine with this hardware
Tesla V100
2V100.10V
2x NVidia Tesla V100
10 CPU
45GB RAM
32GB GPU RAM
With cuda and cudnn successfully installed and these versions:
python version: 3.8.10
tensorflow version: 2.7.0
gpu_device_name: /device:GPU:0
Num GPUs Available: 2
Num GPUs Available(exp): 2
And I ran my train with these parameters
- tensorflow 2
- model_main_tf2.py
- pipeline.config:
BATCH_SIZE=16 NUM_CLASSES=1 NUM_STEPS=200000
- model_main_tf2.py parameters:
NUM_WORKERS=2
- model: ssd_mobilenet_v2_320x320_coco17_tpu-8
The start time was 10:30:02 and here my annotations:
10:38 3800 step
10:40 4500 step
10:43 6300 step
10:46 7700 step
10:52 11000 step
11:00 15100 step
11:13 22500 step
11:28 30700 step
12:00 48000 step
12:15 54400 step
And this log in the 54400 step
I0129 12:15:10.217693 139823678080832 model_lib_v2.py:708] {'Loss/classification_loss': 0.28390533,
'Loss/localization_loss': 0.26865262,
'Loss/regularization_loss': 57.277615,
'Loss/total_loss': 57.83017,
'learning_rate': 0.0}
As we see, with gpus in just 1.5 hours the step 48000 was reached. But it never ends. In the cloud any second is billed, so I need to know how much longer I have to wait
Question
I don't know if the log entry INFO:tensorflow:Step xyz per-step time 1.251s
, is related to the elapsed time of training process, num_steps parameters or something.
How to know if tensorflow 2 training is going forward, is stuck, or will finish?
Research:
- What is the difference between steps and epochs in TensorFlow?
- https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/EarlyStopping
- https://www.quora.com/How-long-does-it-take-to-train-deep-neural-networks-Would-it-be-feasible-for-an-individual-to-replicate-the-performance-of-deep-neural-networks-on-the-MNIST-dataset
- https://github.com/ibab/tensorflow-wavenet/issues/99