5

I was wondering whether in machine learning it is acceptable to have a dataset that may contain the same input multiple times, but each time with another (valid!) output. For instance in the case of machine translation, an input sentence but each time given a different translation.

On the one hand I would say that this is definitely acceptable, because the differences in output might better model small latent features, leading to better generalisation capabilities of the model. On the other, I fear that having the same input multiple times would bias the model for that given input - meaning that the first layers (in a deep neural network) might be "overfitted" on this input. Specifically, this can be tricky when the same input is seen multiple times in the test set, but never in the training set or vice-versa.

Bram Vanroy
  • 27,032
  • 24
  • 137
  • 239
  • I'm not an expert on this domain (although I did experiment with it a bit) but logic dictates that it may be possible. For example on your translation example the result would be that the system may learn that results "a" and "b" are synonyms and even when should each of them be used (for example on text analysis). The only problem is that you will need to provide a big enough training set so your system not to get "bias" or confused (I would be more afraid of confused than bias). – zozo Nov 16 '19 at 09:04
  • Intuitively, exact same inputs should have same outputs unless there is something that you are not aware or you don’t have access to at the moment. For example in the the translation example that you mentioned, if same words (or sentences) have different meanings, it is usually because of the context and the hidden dependancies in the paragraph or the article. Sometimes you can create new features to capture those dependencies easily and sometimes it takes a bit time. – M. Esmalifalak PhD Nov 16 '19 at 11:06
  • @M.EsmalifalakPhD I wholeheartedly disagree. In translation studies it is a well established fact that one sentence can have multiple equally correct translations. – Bram Vanroy Nov 16 '19 at 12:37
  • I didn’t say incorrect output! – M. Esmalifalak PhD Nov 16 '19 at 18:30

3 Answers3

2

In general you can do whatever works and this "whatever works" is also the key to answer your question. The first thing you need to do is to define a performance metric. If the to be learned function is defined as X |-> Y where X is the source sentence and Y is the target sentence, the performance measure is a function f((x,y)) -> |R and in turn can be used to define the loss function which has to be optimised by the neural network.

Let's assume for simplicity that you use accuracy, so the fraction of perfectly matched sentences. if you have conflicting examples like (x,y1) and (x,y2) then you can not reach anymore 100% accuracy which feels weird but doesn't do any harm. The other cool and important fact is that, each sentence can by definition only matched once correctly -- assuming no random component in the predictions of your NN. This means that sentences with more alternative translations are not weighted higher in building models. The advantage is that this approach might cause a bit better generalisation. On the downside this approach might cause a plateau in the loss of your optimisation which might result into a model being stuck between the optimal choice.

A much cleaner approach would be to take the fact that there are alternative translation in the definition of your performance measure/loss into account. You can define the performance metric as

\frac{1}{|D|}\sum{(x,[y_1,..,y_n])\in D 1I_{f(x)\in[y_1,...y_n]}

Where 1I is the indicator function.

This would give a cleaner metric. Obviously you need to adopt the above derivation to your target metric

CAFEBABE
  • 3,983
  • 1
  • 19
  • 38
0

Isn't it a multi-label classification problem?

Multi-label classification is a generalization of multiclass classification, which is the single-label problem of categorizing instances into precisely one of more than two classes; in the multi-label problem there is no constraint on how many of the classes the instance can be assigned to

This looks like this in python:

y = [[2, 3, 4], [2], [0, 1, 3], [0, 1, 2, 3, 4], [0, 1, 2]]
>>> MultiLabelBinarizer().fit_transform(y)
array([[0, 0, 1, 1, 1],
       [0, 0, 1, 0, 0],
       [1, 1, 0, 1, 0],
       [1, 1, 1, 1, 1],
       [1, 1, 1, 0, 0]])

At least you could start so. You could map your case "same input multiple times, but each time with another (valid!) output" as a multi-label approach, and then use techniques to train your algorithm with fit to this. Some of them:

  • Transformation into binary classification problems
  • Transformation into multi-class classification problem
  • Ensemble methods

Also a nice place to look it up, is this site.

PV8
  • 5,799
  • 7
  • 43
  • 87
  • When you work with discrete values, this is indeed a multi-label classification problem. However, in regression or more advanced problems like sequence evaluation, I imagine that things are more complex. – Bram Vanroy Apr 07 '21 at 17:31
-1

Yes, it is acceptable to have the same input with different but equally valid outputs. In fact a neural network will very well fit in this case and there is no reason for it to fail in case of confusing(ambiguous) data.Neural nets work by producing a non-linear function by composing linear functions with a non-linear activation function.Each new variable is created by composing the non-linear activation function with an arbitrary linear combination of the previous variables.

The loss function that is to be minimised is composed of LΘ=∑i(yi−FΘ(xi))2 .One easy way to minimise this is to find the local minima of as a function of Θ=(θ1,…,θM). This loss function when subjected to ambiguous data such as (x1,x2,...,xN,y1) and (x1,x2,...,xN,y2) where y1≠y2 will make the neural network predict the average of y1 and y2.

Imagine a model trained on millions of data points including say (0,0) and that the model has intercept zero. Now add the point (0,100) to the training set. The mean of the new data set will be approximately zero and the model will still approximately predict that 0 -> 0 .