2

The following code

from sklearn import metrics
import numpy as np
y_true = np.array([[0.2,0.8,0],[0.9,0.05,0.05]])
y_predict = np.array([[0.5,0.5,0.0],[0.5,0.4,0.1]])
metrics.log_loss(y_true, y_predict)

produces the following error:

   ---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-32-24beeb19448b> in <module>()
----> 1 metrics.log_loss(y_true, y_predict)

~\AppData\Local\conda\conda\envs\tensorflow\lib\site-packages\sklearn\metrics\classification.py in log_loss(y_true, y_pred, eps, normalize, sample_weight, labels)
   1646         lb.fit(labels)
   1647     else:
-> 1648         lb.fit(y_true)
   1649 
   1650     if len(lb.classes_) == 1:

~\AppData\Local\conda\conda\envs\tensorflow\lib\site-packages\sklearn\preprocessing\label.py in fit(self, y)
    276         self.y_type_ = type_of_target(y)
    277         if 'multioutput' in self.y_type_:
--> 278             raise ValueError("Multioutput target data is not supported with "
    279                              "label binarization")
    280         if _num_samples(y) == 0:

ValueError: Multioutput target data is not supported with label binarization

I am curious why. I am trying to re-read definition of log loss and cannot find anything that would make computations incorrect.

user1700890
  • 7,144
  • 18
  • 87
  • 183
  • In scikit, log_loss is defined only for classification tasks as documented here:- http://scikit-learn.org/stable/modules/classes.html#classification-metrics – Vivek Kumar Jan 30 '18 at 02:19
  • @VivekKumar, thank you Vivek, you meant to say binary classification task? The problem that I stated is still classification, but not binary. – user1700890 Jan 30 '18 at 15:47
  • I have added my interpretion of your question as an answer. Please go through it and tell if thats what you needed or not. – Vivek Kumar Jan 30 '18 at 17:13

2 Answers2

4

The source code indicates that metrics.log_loss does not support probabilities in y_true. It only supports binary indicators of shape (n_samples, n_classes), for example [[0,0,1],[1,0,0]] or class labels of shape (n_samples,), for example [2, 0]. In the latter case the class labels will be one-hot encoded to look like the indicator matrix before calculating log loss.

In this block:

lb = LabelBinarizer()

if labels is not None:
    lb.fit(labels)
else:
    lb.fit(y_true)

You are reaching lb.fit(y_true), which will fail if y_true is not all 1 and/or 0. For example:

>>> import numpy as np
>>> from sklearn import preprocessing

>>> lb = preprocessing.LabelBinarizer()

>>> lb.fit(np.array([[0,1,0],[1,0,0]]))

LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)

>>> lb.fit(np.array([[0.2,0.8,0],[0.9,0.05,0.05]]))

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/imran/.pyenv/versions/anaconda3-4.4.0/lib/python3.6/site-packages/sklearn/preprocessing/label.py", line 278, in fit
    raise ValueError("Multioutput target data is not supported with "
ValueError: Multioutput target data is not supported with label binarization

I would define your own custom log loss function:

def logloss(y_true, y_pred, eps=1e-15):
    y_pred = np.clip(y_pred, eps, 1 - eps)
    return -(y_true * np.log(y_pred)).sum(axis=1).mean()

Here is the output on your data:

>>> logloss(y_true, y_predict)
0.738961717153653
Imran
  • 12,950
  • 8
  • 64
  • 79
  • No. You are wrong. It doesnt require `y_true` to be binary. `y_true` can be of multi-class type. The source code indicates that it will convert the multi-class `y` to binary label indicator. – Vivek Kumar Jan 30 '18 at 17:17
  • `y_true` can have multiple `1`'s, but it can not have values that are not `0` or `1`. – Imran Jan 30 '18 at 17:34
  • There are many times you might want to compute log loss against true labels that are not all `0` or `1`. For example for predicting the underlying probabilities of an inherently stochastic process or for partial class membership eg: This dog is 75% retriever and 25% husky. – Imran Jan 30 '18 at 17:47
  • I think there's still a little confusion between labels and class probabilities in the example you provided. The labels can be 0, 1, 2, but these will be one-hot encoded, so this isn't relevant to log-loss supporting non-binary values. – Imran Jan 30 '18 at 17:52
  • Again, in my example, they are not one-hot encoded. I am not arguing to you that my answer is more correct than yours. I am just pointing out that what you said in your answer was untrue. Thats all. – Vivek Kumar Jan 30 '18 at 17:58
  • Your labels will absolutely be one-hot encoded. Look at the source code for `metrics.log_loss` specifically where it calls `lb.fit` and `lb.transform` and look at the docs for [LabelBinarizer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html) – Imran Jan 31 '18 at 00:17
  • OK, I rephrased my first paragraph to hopefully be more clear. – Imran Jan 31 '18 at 01:10
2

No, I am not talking about binary classification.

The y_true and y_predict you showed above will not be considered as classification targets, unless otherwise specified as such.

First since they are probabilities so it can take any continuous values and hence it is detected as regression in scikit.

Second, each element inside y_pred or y_true is a list of probabilities. That is detected as multi-output. Hence the error of "Multioutput targets".

You need to supply the actual labels for log_loss and not probabilities for y_true (Ground truths). Why do you have probabilities in that, by the way?? Probabilities can be present for predicted data but why for actual data?

For that you need to first convert the probabilities of y_true into labels by considering the highest probability as the winner class.

This can be done by numpy.argmax by using the below code:

import numpy as np
y_true = np.argmax(y_true, axis=1)

print(y_true)
Output:-  [0, 1]
# We will not do this the above for y_predict, because probabilities are allowed in it.

# We will use labels param to declare that we have actually 3 classes, 
# as evident from your probabilities.
metrics.log_loss(y_true, y_predict, labels=[0,1,2])

Output:-  0.6931471805599458

As discussed with @Imran, here's an example having y_true with values other than 0 or 1.

Below example to simply check if other values are allowed or not:

y_true = np.array([0, 1, 2])
y_pred = np.array([[0.5,0.5,0.0],[0.5,0.4,0.1], [0.4,0.1,0.5]])
metrics.log_loss(y_true, y_pred)

Output:- 1.3040076684760489   (No error)
Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
  • Nice use of `argmax`, but one certainly might want fractional values in `y_true` when computing log loss, for example predicting the underlying probabilities of an inherently stochastic process or for partial class membership. It's not clear what @user1700890 wants for his specific case, but my answer should address the general problem for anyone else who finds it. – Imran Jan 30 '18 at 17:46
  • @Imran One may want "fractional values in `y_true`" but thats not supported in the log_loss and that in your answer you have shown a good alternative. I am just listing the proper use of log_loss as its implented in scikit. – Vivek Kumar Jan 30 '18 at 17:59
  • @VivekKumar. You are correct. It is strictly targeting 1 and 0 outcomes. This is a bit narrow implementation of cross-entropy – user1700890 Jan 30 '18 at 18:36