Are raw video better than mp4 encoded video as an input for a Neural Network?

Question

I am working on a project for a university exam in computer vision where the objective is to analyse different road scenarios from videos. For example, it is to do instance segmentation of street images to recognise various objects and catalogue them.

For such a project, it is easy to acquire new data to enrich the dataset. So, can problems arise in training the neural network (or even doing inference) on frames taken from a video encoded as mp4? Is it always better to use frames taken from a video saved in raw?

This question arose because since mp4 (obviously) compresses the frames by performing predictions on the inter frames, we end up with pixel values that are different from the original ones.

I'd say **no, doesn't matter**, because sensors already are noisy. compression, if it's not too severe, won't make the situation worse. the network will learn to deal with noise. -- do you even have the space for uncompressed video? that's an insane amount. — Christoph Rackwitz, Jun 04 '22 at 11:22
Compression artifacts are different than sensor noise. Best is to train a model with as diverse dats as possible, so it generalizes best (and learns the right visual features instead of over specialized features). If you want to use as low amount of data as possible, use only exactly the same data as during inference (same sensor, same lens, same noise). You could even encode and decode during inference, but that's unnecessary expensive. In practice, jpg (or mp4) compression during training only introduces low amount of difference to the DNN. — Micka, Jun 04 '22 at 13:42
One thought: If you already know the input size of the DNN (e.g. 608x608 for a yolo v3) you can compare the full size image encoding (e.g. 1920x1280) + decoding + resizing/interpolation artifacts vs. uncompressed DNN input size images (resized/interpolated and then saved uncompressed). Probably very low difference in the DNN input in both ways. — Micka, Jun 05 '22 at 11:04

score 2 · Accepted Answer · edited Jun 06 '22 at 10:42

Is it always better to use frames taken from a video saved in raw?

No, especially if you will do inference on video which comes from similar or the same video stream (and is compressed similarly). Also unless the quality of the compression is very bad or/and the objects that are to be recognized are very small, a few pixels, and "damaged" by some pixel-precise discontinuities etc. of the compression (blocks etc.) which may cause some confusions, perhaps the results with mp4 frame videos would be practically "the same".

If you are going to do inference on mp4 etc. from the Internet, it's rather better to train to such quality grade (or to reduce the quality of your raw/png input, e.g. save to jpg etc.), rather than to train on high quality and then infer on low quality.

You may have some input with a higher quality, some with a lower than base level, and watch how it goes during training, it may help to generalize better.

One example in another use case: deep fakes, DeepFaceLab. You could extract faces for training in PNG or in JPG. The default is JPG 90 and it's usually considered "enough".The convolutional NN smooths the input and some of the artifacts anyway.

Also even if you use super high quality photos, if the encoders-decoders do not have enough dimensions to encode the full detail, the results would be similar to ones with a lower quality input. Depending on the dimensions chosen, the models usually can't deal properly with beard and moustache, or with fine skin texture etc. (i.e. high frequency detail), even if the input image is very sharp.

predictions on the inter frames, we end up with pixel values that are different from the original ones.

The exact pixel values would be different in intra frames too, unless lossless compression is used, but this is not necessary, because the NN is supposed to discover more general features than exact match of the pixels. Gradients, co-occuring gradients, some kind of contrast, shapes etc.

If you want to do exact pixel matching, you could use template matching techniques, but exact pixel-perfect match is not necessary with that either. https://docs.opencv.org/4.x/d4/dc6/tutorial_py_template_matching.html

Are raw video better than mp4 encoded video as an input for a Neural Network?

1 Answers1