27

After spending a couple days trying to achieve this task, I would like to share my experience of how I went about answering the question:

How do I use TS Object Detection to train using my own dataset?

Sumsuddin Shojib
  • 3,583
  • 3
  • 26
  • 45
eshirima
  • 3,837
  • 5
  • 37
  • 61

2 Answers2

54

This assumes the module is already installed. Please refer to their documentation if not.

Disclaimer

This answer is not meant to be the right or only way of training the object detection module. This is simply I sharing my experience and what has worked for me. I'm open to suggestions and learning more about this as I am still new to ML in general.

TL;DR

  1. Create your own PASCAL VOC format dataset
  2. Generate TFRecords from it
  3. Configure a pipeline
  4. Visualize

Each section of this answer consists of a corresponding Edit (see below). After reading each section, please read its Edit as well for clarifications. Corrections and tips were added for each section.

Tools used

LabelImg: A tool for creating PASCAL VOC format annotations.

1. Create your own PASCAL VOC dataset

PS: For simplicity, the folder naming convention of my answer follows that of Pascal VOC 2012

A peek into the May 2012 dataset, you'll notice the folder as having the following structure

+VOCdevkit +VOC2012 +Annotations +ImageSets +Action +Layout +Main +Segmentation +JPEGImages +SegmentationClass +SegmentationObject

For the time being, amendments were made to the following folders:

Annotations: This is were all the images' corresponding XML files will be placed in. Use the suggested tool above to create the annotations. Do not worry about <truncated> and <difficulty> tags as they will be ignored by the training and eval binaries.

JPEGImages: Location of your actual images. Make sure they are of type JPEG because that's what is currently supported in order to create TFRecords using their provided script.

ImageSets->Main: This simply consists of text files. For each class, there exists a corresponding train.txt, trainval.txt and val.txt. Below is a sample of the contents of the aeroplane_train.txt in the VOC 2012 folder

2008_000008 -1
2008_000015 -1
2008_000019 -1
2008_000023 -1
2008_000028 -1
2008_000033  1

The structure is basically image name followed by a boolean saying whether the corresponding object exists in that image or not. Take for example image 2008_000008 does not consist of an aeroplane hence marked with a -1 but image 2008_000033 does.

I wrote a small Python script to generate these text files. Simply iterate through the image names and assign a 1 or -1 next to them for object existence. I added some randomness among my text files by shuffling the image names.

The {classname}_val.txt files consist of the testing validation datasets. Think of this as the test data during training. You want to divide your dataset into training and validation. More info can be found here. The format of these files is similar to that of training.

At this point, your folder structure should be

+VOCdevkit +VOC2012 +Annotations --(for each image, generated annotation) +ImageSets +Main --(for each class, generated *classname*_train.txt and *classname*_val.txt) +JPEGImages --(a bunch of JPEG images)


1.1 Generating label map

With the dataset prepared, we need to create the corresponding label maps. Navigate to models/object_detection/data and open pascal_label_map.pbtxt.

This file consists of a JSON that assigns an ID and name to each item. Make amendments to this file to reflect your desired objects.


2. Generate TFRecords

If you look into their code especially this line, they explicitly grab the aeroplane_train.txt only. For curios minds, here's why. Change this file name to any of your class train text file.

Make sure VOCdevkit is inside models/object_detection then you can go ahead and generate the TFRecords.

Please go through their code first should you run into any problems. It is self explanatory and well documented.


3. Pipeline Configuration

The instructions should be self explanatory to cover this segment. Sample configs can be found in object_detection/samples/configs.

For those looking to train from scratch as I did, just make sure to remove the fine_tune_checkpoint and from_detection_checkpoint nodes. Here's what my config file looked like for reference.

From here on you can continue with the tutorial and run the training process.


4. Visualize

Be sure to run the eval in parallel to the training in order to be able to visualize the learning process. To quote Jonathan Huang

the best way is to just run the eval.py binary. We typically run this binary in parallel to training, pointing it at the directory holding the checkpoint that is being trained. The eval.py binary will write logs to an eval_dir that you specify which you can then point to with Tensorboard.

You want to see that the mAP has "lifted off" in the first few hours, and then you want to see when it converges. It's hard to tell without looking at these plots how many steps you need.


EDIT I (28 July '17):

I never expected my response to get this much attention so I decided to come back and review it.

Tools

For my fellow Apple users, you could actually use RectLabel for annotations.

Pascal VOC

After digging around, I finally realized that trainval.txt is actually the union of training and validation datasets.

Please look at their official development kit to understand the format even better.

Label Map Generation

At the time of my writing, ID 0 represents none_of_the_above. It is recommended that your IDs start from 1.

Visualize

After running your evaluation and directed tensorboard to your Eval directory, it'll show you the mAP of each category along with each category's performance. This is good but I like seeing my training data as well in parallel with Eval.

To do this, run tensorboard on a different port and point it to your train directory

tensorboard --logdir=${PATH_TO_TRAIN} --port=${DESIRED_NUMBER}
eshirima
  • 3,837
  • 5
  • 37
  • 61
  • it's great to hear that someone got the TF "training locally" example up and running and also modified it for their own dataset! could you let me know what version of Python you used? I am constantly running into Python2.7/Python3 code issues when just trying to run their example for training on PASCAL. thanks! – AruniRC Jul 09 '17 at 16:52
  • I used Python 2.7 and I believe their codebase was written with that in mind. – eshirima Jul 09 '17 at 16:56
  • 1
    thanks. I ended up shifting to 2.7 as well and things were better. – AruniRC Jul 10 '17 at 17:40
  • How do you run eval.py simultaneously with train.py ? I have to do it on another device, for memory reasons, but I don't know how to specify the device... – gdelab Jul 12 '17 at 08:55
  • And do you know how to get the same summaries in tensorboard for the evaluation as for the training ? (TotalLoss, global_step/sec, etc., to compare eval and train more precisely, and get the inference time) – gdelab Jul 12 '17 at 09:00
  • @gdelab I have no idea on how to run _eval.py_ from another device. I ran mine locally by simply opening up a new terminal window. One way I can think of is by you uploading your results into a server, then pull them down on another device and run them there.. To answer your second question, even I've been trying to get that working but no success. I'll update [your Github issue](https://github.com/tensorflow/models/issues/1927) should I find a way – eshirima Jul 12 '17 at 11:27
  • @eshirima , so we need to first create an image classification model? perhaps using inception? – Michael Ramos Jul 12 '17 at 15:23
  • @rambossa. No no no no.. Classification is different from object localization. Classification simply tells you if a certain object exists in your image without actually telling you where in the image. Object localization extends this to include the location of the detected object. If what you want is classification, [this answer](https://stackoverflow.com/questions/44685875/tensorflow-and-opencv-real-time-classification) will help you. But if what you want is object detection, just follow the instructions of this answer. – eshirima Jul 12 '17 at 15:35
  • @eshirima perfect thanks, Detection is definitely what I want, just wasn't sure if classification was a requirement. Thanks for the good and helpful work – Michael Ramos Jul 12 '17 at 15:43
  • @rambossa Anytime.. Don't forget to upvote the answer if it helps resolve your issue.. Happy coding. Cheers mate!! – eshirima Jul 12 '17 at 15:46
  • @eshirima , how did you go about deciding on the size of the images in the training set? I currently have a set of large images, 2880X1800, but am worried these are too big. I would want the images to be as close to real-world resolutions as possible and am afraid that reducing the image size might hurt accuracy when detecting the actual objects. Thoughts? – Michael Ramos Jul 17 '17 at 16:41
  • 1
    2880X1800 is too big for sure. If you look at the config file under `image_resizer`, the object detector ends up resizing every image to 300X300. I feed it images of 618X816 though and it still does a good job of detecting my desired classes. I'd recommend resizing the images first before running the detection to see what scales still maintain a good visual of your objects (this is what I did as well). You could also tweak around the parameters for `image_resizer`, run your detector and compare results. – eshirima Jul 17 '17 at 17:08
  • 1
    @eshirima thanks, So the resizer is also smart enough to adjust the annotations and bounding drawn for the original images? – Michael Ramos Jul 17 '17 at 18:09
  • 1
    I can't give u a concrete answer to that but in the core, the bounding boxes are *estimates* of the location of the pixels that consist of a majority of your objects' attributes/features. The final box that you see is actually a result of multiple closely-packed boxes grouped together. The issue with feeding the entire 2880X1800 is that you'd end up having too many features that'd be impossible to hold them in memory and computationally penalizing resulting to a single layer computation taking a long time. – eshirima Jul 17 '17 at 18:21
  • 1
    The idea behind resizing is to find enough features such that they can be held in memory but not as penalizing computationally.. Theoretically, once it has learnt all of these features, it should be able to find them as well in larger images. But processing large frames is still an ongoing problem in computer vision. – eshirima Jul 17 '17 at 18:24
  • @eshirima I tried to train the model with just 5 images of examples . I converted my data set to Tfrecords with oxfard IIT format. (Using the pet script) But my model giving errors when executing . IIT gives warning of having sparse metrics so I thought of changing the batch size in the config file. But by default it is 1 . So what should I change ? – Shamane Siriwardhana Jul 27 '17 at 14:21
  • 1
    @ShamaneSiriwardhana Was it giving u errors or warnings? If errors, what were they? The batch size is just the number of images you want to feed in at a time. I'm not sure if that'll help resolve the warnings/errors. This [answer](https://stackoverflow.com/a/41176694/4962554) will help u understand batch sizes – eshirima Jul 27 '17 at 14:33
  • @eshirima I know what is batch size. In the default config it says one . Normally in the sgd we take a mini batch . So why they have put 1 batch size in the config file? Yes they are warning (!!) . This is the error "The replica master 0 ran out-of-memory and exited with a non-zero status of 247" – Shamane Siriwardhana Jul 27 '17 at 14:38
  • @ShamaneSiriwardhana [Which](https://github.com/tensorflow/models/tree/master/object_detection/samples/configs) config file are you using? All of them, with exception to ssd extractors have `batch_size:1`. It does make sense why they'd do so because they don't know the users' available memory so set it to the lowest batch size. Also how big are your images? How much memory do u have? I did my training on a 16GB machine. Try tweaking the image size parameters in the config file as well. Or it may be that the model is just too big to keep in memory. Train on _mobilenet_ instead then. – eshirima Jul 27 '17 at 14:59
  • I am using the cloud . The config file was the one in the pet detection tutorial . I actually trained the pet detector as given in the tutorial. Then I wanted to train it for detect two objects , I took only 5 examples and then converted in to tfrecords and uploaded it to the cloud as it given in the tutorial. But I got errors saying "sparsity matrices can take lot of memory etc" . I think it's lake of data to train on the cloud , What do you think – Shamane Siriwardhana Jul 27 '17 at 15:05
  • @ShamaneSiriwardhana I trained mine locally. Its funny you should say the memory thing because I received a warning as well > Userwarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape.This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " But it never actually stopped or crashed. – eshirima Jul 27 '17 at 15:11
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/150331/discussion-between-eshirima-and-shamane-siriwardhana). – eshirima Jul 27 '17 at 15:12
  • @eshirima I think I found the error. It's my image size(nearly 3Mb) and the height and width . In what scale I should keep these things ? – Shamane Siriwardhana Jul 27 '17 at 19:21
  • 1
    @ShamaneSiriwardhana You mentioned that your training images are of 2880 by 1800. How big is the object you are trying to detect in each image? Did your training success eventually? I am building an object detector on my own data, all images are of 1920 b 1080. Each object needed to be detected is only about 50 by 15. It is difficult for me to obtain decent results. – Jundong Sep 10 '17 at 16:51
  • If the TS Object Detection can be combined with transfer learning ? – YuFeng Shen Oct 18 '17 at 08:15
  • How do I apply model to test images? How can I get rectangles? This is whats missing. – Stepan Yakovenko Jul 12 '19 at 19:55
16

I wrote a blog post on Medium about my experience as well on how I trained an object detector (in particular, it's a Raccoon detector) with Tensorflow on my own dataset. This might also be useful for others and is complimentary to eshirima's answer.

Dat Tran
  • 2,368
  • 18
  • 25
  • I actually looked @ your real-time post as well and learnt a lot from it. A couple questions/suggestions. 1: In the config file, do you have an idea of what `num_hard_examples` and `num_examples` represents? 2: For image annotations on a Mac, you could've used [RectLabel](https://itunes.apple.com/us/app/rectlabel-labeling-images-for-object-detection/id1210181730?mt=12). 3: I was actually about to explore training on own dataset which isn't of Pascal Voc format. You beat me to the punch :) – eshirima Jul 28 '17 at 13:12
  • 2
    Hey thanks for the suggestions:) I had a look at RectLabel. Looks pretty good. I will give it a try. Concerning your first question, `num_hard_examples` has something to do with the hard example miner. Have a look at this [paper](https://arxiv.org/abs/1604.03540) to understand this. The `num_examples` has something to do with the evaluation. During the evaluation it fetches images and you need to specify how much you have. They also used `max_eval` to limit evaluation process. For number 3:) Yeh doesn't matter haha it's not who comes first but learning from each other. – Dat Tran Jul 29 '17 at 19:52
  • Hi did you add hard negative samples when training the ssd. Because the paper says it will work better if we add hard negative samples . – Shamane Siriwardhana Aug 07 '17 at 08:15
  • I read your blog @DatTran and i have one question! can we train our dataset using CPU? – Yirga Aug 09 '17 at 08:03
  • 1
    @Yirga sure but that can take a while. – Dat Tran Aug 09 '17 at 08:11
  • @DatTran if we wanted to train for rectangular detection, and trained images on such rectangles, do you think your method would be better at creating a model that recognizes those specific/actual rectangular boundings (vs jumpy bounds like the racoon)? – Michael Ramos Aug 09 '17 at 16:47
  • 1
    @rambossa if you care about the stability of those rectangular, you should have a look at [ROLO](https://github.com/Guanghan/ROLO). – Dat Tran Aug 09 '17 at 20:58
  • @DatTran Hi I am trying to train SSD -mobilenet in-order to detect 13 classes. I also trained a faster rcnn -resnet101 . My training data images have resolution of 265 * 450 . (most of them) and each class had 400 images. Then this weird thing happened faster rcnn converged faster with batch size of 1 . But my SSD didn't . It's not converging at all . Here I will put the two graphs , Here are the loss graphs https://stackoverflow.com/questions/45633957/ssd-mobilenet-object-detection-algorithm-not-converging – Shamane Siriwardhana Aug 11 '17 at 12:14
  • @eshirima I found a nice explanation about hard negative samples from the SSD research paper . Normally when we talking the loss most number of default boxes are negative . So there is an imbalance . So we will select at negative boxes with highest confidence loss (Which means top most things algorithm couldn't correctly identify them as background ). "we sort them using the highest confidence loss for each default box and pick the top ones so that the ratio between the negatives and positives is at most 3:1. We found that this leads to faster optimization and a more stable training" – Shamane Siriwardhana Aug 14 '17 at 07:42
  • @ShamaneSiriwardhana Which paper? Is it titled _SSD: Single Shot MultiBox Detector_ – eshirima Aug 14 '17 at 12:25
  • @eshirima Yeah . That's the paper. Check with those quotes . And did you get any false positive examples ? – Shamane Siriwardhana Aug 14 '17 at 13:27
  • @eshirima I have this question which data set orientation is good ? Oxford IIT or PASCAL . I think with oxford IIT when converting data in to TF records you can only have one class in the training image. No multiple types . – Shamane Siriwardhana Aug 14 '17 at 15:27
  • 1
    @ShamaneSiriwardhana I encountered some false positive detections after training. This is prone to happen because the model isn't guaranteed to always be 100% correct since the _mAP_ never fully converges to 0. Regarding data sets, I used PASCAL because it was the industry standard before [ImageNet](http://image-net.org/index) hence a larger community. – eshirima Aug 14 '17 at 17:25
  • @eshirima There is no much difference between pascal and oxford IIT . In the given converting script for oxford data set uses image name as the label of the data . So which means we can only draw one bounding box in an image. That box should belongs to class that indicated by image name . But in the given script for converting pascal data set , it's clear that it obtain the class from the object name in XML file. So we can draw multiple objects in one image. Any way my image set had only one bounding box per an image. What about you ? You had many ground truth boxes ? – Shamane Siriwardhana Aug 14 '17 at 18:25
  • @eshirima Refer to code in this from TF-OD API . https://github.com/tensorflow/models/blob/master/object_detection/core/losses.py#L339 – Shamane Siriwardhana Aug 15 '17 at 07:55
  • Do you think than can affect the accuracy ? – Shamane Siriwardhana Aug 15 '17 at 12:34
  • @DatTran do you have any insight on more accurate result bounding boxes? (ones that can skew and rotate with the object detected) vs the generic square around the result. I've seen some opencv examples that seem to do this... – Michael Ramos Aug 21 '17 at 18:33
  • May I know if the TS Object Detection can be combined with transfer learning ? – YuFeng Shen Oct 18 '17 at 08:15
  • @Jason yes. you can load already trained models and refine them on your dataset. Consider asking another question with more details – Ciprian Tomoiagă Oct 26 '17 at 06:40