Self-training a HAAR classifier results in disappointingly low accuracy

Question

I'm trying to train a HAAR classifier with OpenCV 2.4 to detect the head of squash rackets. Unfortunately the results in terms of accuracy are fairly bad and I'd like to understand what part of my process is flawed. At this point I'm not worried about performance as I won't be using it as a real time detector.

Negative samples

I used some online image database to obtain random pictures (of different widths and heights).
I also added a couple of Squash related negative images such as empty courts, or pictures of players on courts where no racket head is visible (less than 20 in total).

Positive samples

I created a total of 4168 positive samples, of which

168 are manually annotated shots of game recordings
4000 are samples created using opencv_createsamples
opencv_createsamples -img img/sample/r2_white.png -bg img/neg.txt -info img/generated/info.txt -pngoutput img/generated -maxxangle 0.85 -maxyangle -0.85 -maxzangle 0.85 -num 4000
I used relatively high max angles as I felt this would be more representative of how Squash rackets occur on match recordings.

Vector

After consolidating the annotations of the manually annotated and the generated samples, I created the vector with the following parameters:
opencv_createsamples -info img/pos_all.txt -num 4168 -w 25 -h 25 -vec model/vector/positives_all.vec -maxxangle 0.85 -maxyangle -0.85 -maxzangle 0.85

Training

I trained the model with the following parameters. Again added -mode ALL as I felt rotations of the features would be more representative of real world squash games.
opencv_traincascade -data ../model -vec ../model/vector/positives_all.vec -bg neg.txt -numPos 3900 -numNeg 7000 -numStages 10 -w 25 -h 25 -numThreads 12 -maxFalseAlarmRate 0.3 -mode ALL -precalcValBufSize 3072 -precalcIdxBufSize 3072

The training took about 10 hours in total but even at the 100th sample of the last stage the false alarm was still 0.84 (provided that I interpret the training output correctly). The lowest was 0.74 at the end of Stage 5.

===== TRAINING 9-stage =====
<BEGIN
POS count : consumed 3900 : 4095
NEG count : acceptanceRatio 7000 : 0.0304295
Precalculation time: 16

N	HR	FA
1	1	1
2	1	1
3	1	1
4	1	1
5	1	1
6	1	1
7	1	0.998857
...	...	...
98	0.995128	0.840857
99	0.995128	0.850571
100	0.995128	0.842714

END>

Outcome

The classifier doesn't seem to do a great job, with loads of false positives and false negatives too. I played around with the minNeighbors and scaleFactor parameters, to no avail. In the case below I'm using detectMultiScale(gray, 2, 75):

Questions

Is my use case realistic? Could there be any reason that makes rackets particularly hard to detect?
Are my positive samples sufficient?
- Could the angles or the lack of a transparent background in the generated samples be a problem?
- Or is the ration of manually annotated to generated samples (168:4000) insufficient?
Is the ratio of positive and negative samples used for training (3900:7000) sufficient?
Is my approach to training appropriate?
- Is there anything wrong with my training parameters (e.g. feature height/width in the context of racket shape)?
- What could be the reason for my false alarm rate to stagnate during training?

From my understanding, haar classifiers have problems with bigger rotations (x,y,z). The successful haar classifiers for human faces are trained on a single orientation and therefore there is is one for front-face and another one for side-face. In addition, you will need more negative samples (best empty courts and persons). When creating samples you should make sure a transparent background is applied. Better increase the number of stages and increase the false alarm rate to 0.5. — Micka, May 31 '21 at 04:59
and think about using deep learning detectors, they are way better! — Micka, May 31 '21 at 05:00
Thanks @Micka. Especially the rotation bit is a good clue. I thought that I'd probably have to look at deep learning detectors to make this example work but at the same time I'd just like to figure out what the major issues are with my approach here. — tgikf, May 31 '21 at 05:10
as an example, For negative samples I've used 150k negative small images and about 50k full size images. With 15k positive object samples and with these settings: `-numPos 12500 -numNeg 25000 -minHitRate 0.999 -maxFalseAlarmRate 0.5 -w 24 -h 24` Most important information during the training is: NEG count acceptance ratio: This should roughly agree to your max false alarm rate per stage, so if you have 0.5 maxFAR you should see about 0.5 after stage 0, 0.5^2 after stage 1, etc. If this value is much smaller, your negative samples are not distinct enough. If too high, the classifier is too weak — Micka, May 31 '21 at 07:25
all samples were directly from an industrial use case, so training and target domain were known quite well — Micka, May 31 '21 at 07:33
Thanks for the extra context. In my case, how is `acceptanceRatio 7000 : 0.0304295` to be read? Is the ratio 0.03 and hence way too low or is it (7000/0.03) and way too high? — tgikf, May 31 '21 at 09:39
0.03 in stage 9 vs. theoretical 0.3^9 = 0.000019683 imho is an indicator for that the object is too complicated to train (e.g. different rotations dont share enough features). — Micka, May 31 '21 at 09:59
However there can be jumps. for example with 0.5 max far I observed in my examples acceptance ratios: stage6:0.00194; stage7:0.000518; stage8:0.000087; stage9:0.00011 — Micka, May 31 '21 at 10:03