I was wondering whether in machine learning it is acceptable to have a dataset that may contain the same input multiple times, but each time with another (valid!) output. For instance in the case of machine translation, an input sentence but each time given a different translation.
On the one hand I would say that this is definitely acceptable, because the differences in output might better model small latent features, leading to better generalisation capabilities of the model. On the other, I fear that having the same input multiple times would bias the model for that given input - meaning that the first layers (in a deep neural network) might be "overfitted" on this input. Specifically, this can be tricky when the same input is seen multiple times in the test set, but never in the training set or vice-versa.