What are the disadvantages of mirroring CNN training images that contain anchored data?

Question

I am training various CNNs (AlexNet, InceptionV3 and ResNet). The dataset consists of screen captures of a game and an array of 4 classes representing the input for that given capture as [w,a,s,d].

To reduce the data I need to gather, I've looked into mirroring captures with classes that appear less frequently. If I was mirroring a left-turning capture, for example, I would also change the labels so [0,1,0,0] would become [0,0,0,1]. I'm unsure if mirroring will work as the minimap in the bottom-left corner of original images contains a GPS route.

I haven't trained any models yet.

I am mirroring the images and adjusting the labels via opencv:

if choice[1]:
    new_choice[1] = 0
    new_choice[3] = 1
if choice[3]:
    new_choice[1] = 1
    new_choice[3] = 0

if new_choice != choice:
    cv2.imshow('capture', img)
    print("capture:", choice)

    flip = cv2.flip(img, 1)
    cv2.imshow('flipped', flip)
    print("flipped:", new_choice)

What impact on the CNN will a mirrored training dataset cause?
I.e. Will it fail to see the minimap in the bottom-left corner as it was only there in half of the training examples?

Example capture and its mirrored counterpart

there can ve a lot of problems, e.g. is you classify traffic signs, some of them can ve mirrored and others will have a whole different meaning after mirroring (e.g. turn-left/right signs). You already mentioned the minimap in your domain. If the minimap (or any other statically placed graphic thing) is important for your classification, you should not mirror. If left/right direction is important in your domain, don't mirror. If targets can only occur in one "appearance orientation" in your domain, don't mirror. In the end: Just try/compare it and post your results ;) — Micka, Apr 25 '19 at 06:31
I have had an initial play around. 10,000 examples over 30 epochs reached ~70% accuracy. 10,000 examples plus 10,000 mirrored examples over 60 epochs reached ~55% accuracy. Both had a batch size of 16 (hardware limits...). Will do a more in-depth experiment, and keep certain variables constant, at a later date and post the results :) — Roqux, Apr 30 '19 at 20:09
@Micka FYI, I have done some experimenting and added an answer with the results. The TLDR is basically the mirrored dataset only seems to work when using HSV or YCrCb as the image channels — Roqux, May 04 '19 at 23:09

Roqux · Accepted Answer · 2019-05-14T23:06:50.470

Experimental results

Constants

Library: TFLearn
Base model: Alexnet
Input size: 256 by 192 px
Output size: 4 (multi-label)
Output activation: Sigmoid
Loss function: Binary crossentropy
Optimizer: Momentum
Epochs: 30
Learning rate: 1e-3

Independant variables

Original dataset: 10981 input/output pairs
Mirrored dataset: 20997 input/output pairs
Channels: RGB, Greyscale, HSV, YCrCb

Results

Training accuracy and loss of various models once trained

╔════════════════════╤══════════════════╤══════════════════╗
║ Dataset x Channels │ Original         │ Mirrored         ║
║                    │ (Accuracy, Loss) │ (Accuracy, Loss) ║
╠════════════════════╪══════════════════╪══════════════════╣
║ RGB                │ 0.7843, 0.5767   │ 0.6966, 0.579    ║
╟────────────────────┼──────────────────┼──────────────────╢
║ Grey               │ 0.8464, 0.576    │ 0.7206, 0.6204   ║
╟────────────────────┼──────────────────┼──────────────────╢
║ HSV                │ 0.7515, 0.563    │ 0.8301, 0.562    ║
╟────────────────────┼──────────────────┼──────────────────╢
║ YCrCb              │ 0.794,  0.6313   │ 0.8536, 0.612    ║
╚════════════════════╧══════════════════╧══════════════════╝

These results are for the training dataset as I was having issues with getting the validation to work (validation was working fine with categorical_crossentroy, but stopped working when using binary_crossentroy).

Summary

The RGB and Greyscale models performed better on the original dataset than on the mirrored dataset.
The HSV and YCrCb models performed better on the mirrored dataset than on the original dataset.
All models eventually started losing accuracy except for YCrCb.
- YCrCb on the original dataset held constant.
- YCrCb on the mirrored dataset began to trend upward.

EDIT

I had been investigation why the accuracy was ~80% at the start of training.

If the truth label was [1,0,0,0] and the prediction was [0,0,0,0], the accuracy is 75% as three of the labels were correctly guessed...
I am currently looking into a better way to calculate accuracy (hamming_score, confusion matrix, etc).