0

I am using ubuntu 16.04, with GPU Geforce 1080, 8 GB GPU memory.

I have properly created TF-record files, and I trained the model successfully. However I still have two problems.

I did the following steps and I still have two problems, just tell me please what I am missing:-

I used VOCdevkit and I properly created two files which are:- pascal_train.record and pascal_val.record

Then,

1- From this link, I used the raccoon images, I placed them into the following directory models/object_detection/VOCdevkit/VOC2012/JPEGImages (after I deleted the previous images).

Then, I used the raccoon annotation, I placed them into the following directory models/object_detection/VOCdevkit/VOC2012/Annotation (after I deleted the previous ones).

2- I modified the models/object_detection/data/pascal_label_map.pbxt and I wrote one class name which is 'raccoon'

3- I used ssd_mobilenet_v1_pets.config. I modified it, the number of class is only one and I did not train from scratch, I used ssd_mobilenet_v1_coco_11_06_2017/model.ckpt

   fine_tune_checkpoint: "/home/jesse/abdu-py2/models/model/ssd_mobilenet_v1_coco_11_06_2017/model.ckpt"

  from_detection_checkpoint: true

4- From this link I arrange my data structure which is like that:-

  1. models

    1.1 model

     1.1.1 ssd_mobilenet_v1_pets.config
    
     1.1.2 train
    
     1.1.3 evaluation
    
     1.1.4 ssd_mobilenet_v1_coco_11_06_2017/model.ckpt
    

    1.2 object_detection

    1.2.1 data that contains (pascal_train.record, pascal_val.record, and pascal_label_map.pbtxt)

    1.2.2 VOCdevkit

    1.2.2.1 VOC2012
    
       1.2.2.1.1 JPEGImages (my own images)
    
          1.2.2.1.2 Annotations (raccoon annotation)
          1.2.2.1.3 ImageSets
            1.2.2.1.3.1 Main (raccoon_train.txt,raccoon_val.txt,raccoon_train_val.txt)       
    

5- Now, I will train my model

(abdu-py2) jesse@jesse-System-Product-Name:~/abdu-py2/models$ python object_detection/train.py --logtostderr --pipeline_config_path=/home/jesse/abdu-py2/models/model/ssd_mobilenet_v1_pets.config --train_dir=/home/jesse/abdu-py2/models/model/train

Every thing looks fine, I created it many files like checkpoint and events.out.tfevents.1503337171 file (and others) after many thousands of training steps.

However, my two problems are:-

1- Based on this link, I can not run evaluation eval.py (for memory reason) at the same time with train.py.

2- I tried to use events.out.tfevents.1503337171 file that I created from training steps, but it seems it has not been created correctly.

So, I don't know where I am mistaken, I think my data structure is not correct, I tried to arrange it based on my understanding.

Thanks in advance

Edit:-

Regarding Q2/

I figured it out how to convert the events files and model.ckpt files (that I created them from training process) to inference_graph_.pb . The inference_graph_.pb could be tested later with object_detection_tutorial.ipynb. For my case I tried it, but I could not detect anything since I am mistaken somewhere during train.py process.

The following steps convert the trained files to .pb files

(abdu-py2) jesse@jesse-System-Product-Name:~/abdu-py2/models$ python object_detection/export_inference_graph.py \

--input_type image_tensor  \

 --pipeline_config_path /home/jesse/abdu-py2/models/model/ssd_mobilenet_v1_pets.config \

--trained_checkpoint_prefix /home/jesse/abdu-py2/models/model/train/model.ckpt-27688 \

 --output_directory /home/jesse/abdu-py2/models/model
Abduoit
  • 13
  • 2
  • 7

2 Answers2

0

Question 1 - this is just a problem that you'll encounter because of your hardware. Once you get to a point where you'd like to a evaluate the model, just stop your training and run your eval command (it seems as though you've successfully evaluated your model, so you know the command). It will provide you a some metrics for the most recent model checkpoint. You can iterate through this process until you're comfortable with the performance of your model.

Question 2 - These event files are used as input into Tensorboard. The events files are in binary format, thus are not human readable. Start a Tensorboard application while your model is training and/or evaluating. To do so, run something like this:

tensorboard --logdir=train:/home/grasp001/abdu-py2/models/object_detection/train1/train,eval:/home/grasp001/abdu-py2/models/object_detection/train1/eval

Once you have Tensorboard running, use your web browser to navigate to localhost:6006 to check out your metrics. You can use this during training as well to monitor loss and other metrics for each step of training.

ncaadam
  • 51
  • 2
  • Thx @ ncaadam Q1/ I have 8 GB GPU memory, should I use another GPU, or there is another way to avoid this issue. I trained the model until 28,000 steps. When I started eval.py I got this error WARNING:root:The following classes have no ground truth examples: 0 – Abduoit Aug 22 '17 at 22:42
  • I edited the question, Please have a look, my be I am mistaken somewhere. – Abduoit Aug 22 '17 at 23:30
  • If you get this warning this means your model is not well trained yet. Just continue training as long as you can see some results on Tensorboard which actually starts to emerge pretty early so if this doesn't happen then your training is still not correct. – Dat Tran Aug 23 '17 at 12:45
  • thx @DatTran, I trained until 27,688 steps, the total-loss is not good, it has decreased until half only, and I can't see mAP since there's no evaluation. No enough memory to run train.py and eval.py simultaneously. I need to know how to run both on single GPU (my gpu memory is 8 GB). So, I think the model is not trained properly because of certain mistakes in the preparation steps before start training, not because not enough training steps, since I already trained it until 27,688 steps which I think should be enough to get good model. – Abduoit Aug 23 '17 at 13:15
  • I've received this warning before as well. I always thought this was just a warning for class **0**. Class **0** is "unidentified" as class **1** should be the class you're trying to train (if you have 1 class). Ultimately, this warning can be ignored. Is my understanding incorrect @DatTran? This is brought up here: https://github.com/tensorflow/models/issues/1936 – ncaadam Aug 23 '17 at 14:10
  • It is also reference here: https://github.com/tensorflow/models/issues/1856 – ncaadam Aug 23 '17 at 14:32
  • @Abduoit - did you see your mAP on tensorboard when you ran the eval step for model-ckpt-27688? If so, what was it? Stopping training then running an eval step should be sufficient to check your model's progress. While it isn't ideal, it still works. I'm currently using the same methodology. – ncaadam Aug 23 '17 at 14:40
  • @ncaadam I have only one class, I used 200 images, and 200 annotation for training. I have stopped the train.py after 27,688 steps. I ran eval.py I got this error WARNING:root:The following classes have no ground truth examples: 0. I checked TensorBoard, the total-loss is bad, it has decreased until half, and mAP is very bad isn't started yet. – Abduoit Aug 23 '17 at 14:50
  • @ncaadam Would you tell me please, what do u hv in the /models/object_detection/VOCdevkit/VOC2012/ImageSets/Main. How did u create the files inside the /Main directory. ??? – Abduoit Aug 23 '17 at 14:55
  • @Abduoit That is not an error. Only a warning. You can disregard it. You should still get an actual mAP value in your tensorboard. Even if it is low, as long as your loss is dropping, it means training is occurring. If the lost number stays in the same range, something isn't right. If you're using the `object_detection_tutorial.ipynb` to evaluate a frozen model, you can lower the prediction threshold of what to show to something like `0.10`. If you do that, does it successfully identify anything? All-in-all i feel as though I've answered your question. Would you mind accepting my answer? – ncaadam Aug 23 '17 at 15:34
  • @ncaadam thx a ton for your help, The matched and unmatched threshold in file .config are 0.5. Now I am running train and eval together, I notice that total-loss and mAP are both getting better. I will wait until many thousands of training steps, then I will evaluate my file with **object_detection_tutorial.ipynb** . I just want to know what do u hv in **/models/object_detection/VOCdevkit/VOC2012/ImageSets/Main**. How did u create the files inside the **/Main directory**. ??? – Abduoit Aug 23 '17 at 16:29
  • I do not have that folder in my cloned repository. That is likely because I've never downloaded the PASCAL VOC dataset. I followed @DatTran's raccoon tutorial to create my TFRecord files, if that is what you're asking. – ncaadam Aug 23 '17 at 17:59
  • thx @ncaadam I would say your answers really solve the questions – Abduoit Aug 24 '17 at 22:09
  • @ncaadam I did make the mark green is that all, I even wanted to add my edited answer, but my answer is no longer accepted – Abduoit Sep 02 '17 at 16:36
0

Trainer.py line 370 after the session_config

Limit the gpu proccess power

session_config.gpu_options.per_process_gpu_memory_fraction = 0.5

and then you can run eval.py at the same time. The tensorflow use all the free memory independently if it needs it

Eric Aya
  • 69,473
  • 35
  • 181
  • 253