0

I wanted to do transfer learning using a ssd + mobilenetv2 model with my own images. I have only one class. The images were downloaded from OpenImageDataSet. I used tensorflow's object detection API. But the training stuck at step = 0.

I verified that the TFRecord was correctly created as I can use the same data to train faster_rcnn with object detetion APIs. I created my own config file using the one in the repos: ssd_mobilenet_v2_oid_v4.config.

I also tried to start with ssd_mobilenet_v2_coco_2018_03_29.tar.gz using corresponding config file. The behavior is the same -- it also stuck at the same place.

####################
CONSOLE LOG:
Instructions for updating:
Use standard file utilities to get mtimes.
INFO:tensorflow:Running local_init_op.
I0416 16:30:39.198738 19792 session_manager.py:500] Running local_init_op.
INFO:tensorflow:Done running local_init_op.
I0416 16:30:39.632495 19792 session_manager.py:502] Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into D:\work\cv\others\my-tf2-od-transfer-ssd-mobilenet-v2\model.ckpt.
I0416 16:30:48.724722 19792 basic_session_run_hooks.py:606] Saving checkpoints for 0 into D:\work\cv\others\my-tf2-od-transfer-ssd-mobilenet-v2\model.ckpt.
2020-04-16 16:30:59.919297: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-04-16 16:31:00.964680: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312] Internal: Invoking ptxas not supported on Windows
Relying on driver to perform ptx compilation. This message will be only logged once.
2020-04-16 16:31:00.986098: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll
INFO:tensorflow:loss = 12.512502, step = 0
I0416 16:31:02.740392 19792 basic_session_run_hooks.py:262] loss = 12.512502, step = 0 [STUCK HERE]
Machavity
  • 30,841
  • 27
  • 92
  • 100
jackyvr
  • 11
  • 1
  • 4

2 Answers2

0

are you sure it is stuck? do you get any errors? During the training process, TF OD API writes logs into an event file (can be opened using tensorboard) in the model directory. look in your model directory and see if there is an eventfile written there, look at its time stamp to see if it is being updated.

Tamir Tapuhi
  • 406
  • 3
  • 8
  • Thanks @Tamir Tapuhi for pointing me to tensorboard and the event file. I verified that the event didn't get updated. Looking at the graphs in tensorboard, only the dot at step=0 is there. Any other suggestions? Thanks! – jackyvr Apr 20 '20 at 22:37
  • Sounds weird. 1. how much time did you give the process before you decided it is stuck? 2. what is your batch size? 3. try to run *htop* and see the memory and cpu consumption\ – Tamir Tapuhi Apr 21 '20 at 18:45
  • I found out that the combination of TF 1.15 GPU version + my setup causes the problem. Downgrading it to TF 1.14 solves the issue. It is a common and open issue on Tensorflow: https://github.com/tensorflow/models/issues/7640 Thanks a lot for helping out! – jackyvr Apr 22 '20 at 04:43
0

I found out that the combination of TF 1.15 GPU version + my setup causes the problem: "Invoking ptxas not supported on Windows". Downgrading it to TF 1.14 GPU or using TF 1.15 CPU solves the issue. It is a common and open issue on Tensorflow: HERE

jackyvr
  • 11
  • 1
  • 4