4

I am trying to reproduce the training process of dlib's frontal_face_detector(). I am using the very same dataset (from http://dlib.net/files/data/dlib_face_detector_training_data.tar.gz) as dlib say they used, by union of frontal and profile faces + their reflections.

My problems are: 1. Very high memory usage for the whole dataset (30+Gb) 2. Training on partial dataset does not yield very high recall rate, 50-60 percent as compared to frontal_face_detector's 80-90 (testing on sub-set of images not used for training). 3. The detectors work badly on low resolution images and thus fail in detecting faces that are more than 1-1.5 meters deep. 4. Training run time increases significantly with SVM's C parameter that I have to increase to achieve better recall rate (I suspect that this is just overfitting artifact)

My original motivation in trainig was a. gaining the ability to adapt to the specific environment where the camera is installed by e.g. hard negative mining. b. improving detection in depth + run time by reducing the 80x80 window to 64x64 or even 48x48.

Am I on the right path? Do I miss anything? Please help...

Eldar Ron
  • 71
  • 4

1 Answers1

4

The training parameters used were recorded in a comment in dlib's code here http://dlib.net/dlib/image_processing/frontal_face_detector.h.html. For reference:

        It is built out of 5 HOG filters. A front looking, left looking, right looking, 
    front looking but rotated left, and finally a front looking but rotated right one.

    Moreover, here is the training log and parameters used to generate the filters:
    The front detector:
        trained on mirrored set of labeled_faces_in_the_wild/frontal_faces.xml
        upsampled each image by 2:1
        used pyramid_down<6> 
        loss per missed target: 1
        epsilon: 0.05
        padding: 0
        detection window size: 80 80
        C: 700
        nuclear norm regularizer: 9
        cell_size: 8
        num filters: 78
        num images: 4748
        Train detector (precision,recall,AP): 0.999793 0.895517 0.895368 
        singular value threshold: 0.15

    The left detector:
        trained on labeled_faces_in_the_wild/left_faces.xml
        upsampled each image by 2:1
        used pyramid_down<6> 
        loss per missed target: 2
        epsilon: 0.05
        padding: 0
        detection window size: 80 80
        C: 250
        nuclear norm regularizer: 8
        cell_size: 8
        num filters: 63
        num images: 493
        Train detector (precision,recall,AP): 0.991803  0.86019 0.859486 
        singular value threshold: 0.15

    The right detector:
        trained left-right flip of labeled_faces_in_the_wild/left_faces.xml
        upsampled each image by 2:1
        used pyramid_down<6> 
        loss per missed target: 2
        epsilon: 0.05
        padding: 0
        detection window size: 80 80
        C: 250
        nuclear norm regularizer: 8
        cell_size: 8
        num filters: 66
        num images: 493
        Train detector (precision,recall,AP): 0.991781  0.85782 0.857341 
        singular value threshold: 0.19

    The front-rotate-left detector:
        trained on mirrored set of labeled_faces_in_the_wild/frontal_faces.xml
        upsampled each image by 2:1
        used pyramid_down<6> 
        rotated left 27 degrees
        loss per missed target: 1
        epsilon: 0.05
        padding: 0
        detection window size: 80 80
        C: 700
        nuclear norm regularizer: 9
        cell_size: 8
        num images: 4748
        singular value threshold: 0.12

    The front-rotate-right detector:
        trained on mirrored set of labeled_faces_in_the_wild/frontal_faces.xml
        upsampled each image by 2:1
        used pyramid_down<6> 
        rotated right 27 degrees
        loss per missed target: 1
        epsilon: 0.05
        padding: 0
        detection window size: 80 80
        C: 700
        nuclear norm regularizer: 9
        cell_size: 8
        num filters: 89
        num images: 4748
        Train detector (precision,recall,AP):        1 0.897369 0.897369 
        singular value threshold: 0.15

What the parameters are and how to set them is all explained in the dlib documentation. There is also a paper that describes the training algorithm: Max-Margin Object Detection.

Yes, it can take a lot of RAM to run the trainer.

Davis King
  • 4,731
  • 1
  • 25
  • 26
  • Where can I find the dataset and the XMLs for download? – Eldar Ron Jul 03 '17 at 20:13
  • It's the URL you posted. – Davis King Jul 03 '17 at 20:26
  • Found it, thanks. Regarding the rotate-left, rotate-right versions: are they augmented, i.e., computed from frontal faces artificially, and how? – Eldar Ron Jul 05 '17 at 05:06
  • I was wondering if it's simple in-plane rotation, or a projective transformation with change of perspective (around image y axis). I assumed it is in-plane but wasn't sure. – Eldar Ron Jul 06 '17 at 12:55
  • Just in plane rotation. – Davis King Jul 06 '17 at 12:56
  • What does it mean `upsampled each image by 2:1`? Using `time ./examples/build/train_object_detector -tv ./examples/dlib_face_detector_training_data/frontal_faces.xml -u1 --flip --threads 12 --target-size 6400`, as I understand `-u1` upsample images x2, but still some bboxes in dataset are too small: `Error! An impossible set of object boxes was given for training. All the boxes need to have a similar aspect ratio and also not be smaller than about 1600 pixels in area` – mrgloom Mar 04 '19 at 12:11