1

I want to compute the log-likelihood of a logistic regression model.

def sigma(x):
    return 1 / (1 + np.exp(-x))

def logll(y, X, w):
    """"
    Parameters
    y : ndarray of shape (N,)
        Binary labels (either 0 or 1).
    X : ndarray of shape (N,D)
        Design matrix.
    w : ndarray of shape (D,)
        Weight vector.
    """
    p = sigma(X @ w)
    y_1 = y @ np.log(p)
    y_0 = (1 - y) @ (1 - np.log(1 - p))
    return y_1 + y_0

logll(y, Xz, np.linspace(-5,5,D))

Applying this function results in

/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:16: 
RuntimeWarning: divide by zero encountered in log
  app.launch_new_instance()

I would expect y_0 to be a negative float. How can I avoid this error and is there a bug somewhere in the code?

Edit 1

X @ w statistics:
Max: 550.775133944
Min: -141.972597608
Sigma(max): 1.0 => Throws error in y_0 in np.log(1 - 1.0)
Sigma(min): 2.19828642169e-62

Edit 2

I also have access to this logsigma function that computes sigma in log space:

def logsigma (x):
   return np.vectorize(np.log)(sigma(x))

Unfortunately, I don't find a way to rewrite y_0 then. The following is my approach but obviously not correct.

def l(y, X, w):
    y_1 = np.dot(y, logsigma(X @ w))
    y_0 = (1 - y) @ (1 - np.log(1 - logsigma(X @ w)))
    return y_1 + y_0
Steffen Schmitz
  • 860
  • 3
  • 16
  • 34

1 Answers1

2

First of all, I think you've made a mistake in your log-likelihood formula: it should be a plain sum of y_0 and y_1, not sum of exponentials:

log-likelihood

Division by zero can be caused by large negative values (I mean large by abs value) in X @ w, e.g. sigma(-800) is exactly 0.0 on my machine, so the log of it results in "RuntimeWarning: divide by zero encountered in log".

Make sure you initialize your network with small values near zero and you don't have exploding gradients after several iterations of backprop.

By the way, here's the code I use for cross-entropy loss, which works also in multi-class problems:

def softmax_loss(x, y):
  """
  - x: Input data, of shape (N, C) where x[i, j] is the score for the jth class
    for the ith input.
  - y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and
    0 <= y[i] < C
  """
  probs = np.exp(x - np.max(x, axis=1, keepdims=True))
  probs /= np.sum(probs, axis=1, keepdims=True)
  N = x.shape[0]
  return -np.sum(np.log(probs[np.arange(N), y])) / N

UPD: When nothing else helps, there is one more numerical trick (discussed in the comments): compute log(p+epsilon) and log(1-p+epsilon) with a small positive epsilon value. This ensures that log(0.0) never happens.

Maxim
  • 52,561
  • 27
  • 155
  • 209
  • I think the problem lies in the other direction. The value of sigma(max(X @ w)) is one and leads to np.log(0) (See edit). Is there any way I can catch this in my sigma function? – Steffen Schmitz Oct 02 '17 at 17:59
  • Right, it can be large positive value as well. Of course, you apply numerical tricks, e.g. add a small `epsilon` like this `log(p+epsilon)`, to make sure it's greater than zero. But in almost all cases the neural net should learn small numbers and no tricks are necessary. Do you use a regularizer? – Maxim Oct 02 '17 at 18:06
  • No, I only use the sklearn.preprocessing.StandardScaler to prepare X. It is an academic setting so using a regularizer is out of scope. Is it possible to do the computation in log space to avoid this? – Steffen Schmitz Oct 02 '17 at 18:14
  • Actually, a modern way to add a regularization is through dropout or batchnorm, if it's acceptable in your setting. If not, how would you fight overfitting? You have input features that are much bigger than other features, log-scaling them will of course help. – Maxim Oct 02 '17 at 18:46
  • I edited my question with an approach on how to do this in log-scale. Unfortunately, I'm not able to transform y_0 correctly. Could you have a look where my mistake is? We have to avoid regularization in this exercise. – Steffen Schmitz Oct 02 '17 at 19:19
  • That's not what I meant when I said do log-scaling. Your method was correct (with additions I mentioned), and turning sigmoid into log-sigmoid is just making it compute something different, other than cross-entropy. The problem is with `X` or `w` or both. Can you do some stat check on `X`? Can you normalize certain features in it? – Maxim Oct 02 '17 at 19:27
  • Normalization is not an option. I will check with my colleagues how they approached it and come back to you. Thank your for your help so far! – Steffen Schmitz Oct 02 '17 at 21:38
  • I chose to replace values that are equal to 1.0 with 0.99999 before I subtract them from one to avoid calculating log(0). This comes close to your recommendation of calculating log(p + epsilon). Could you include this in your answer? I will accept afterwards. Thank you for your help. – Steffen Schmitz Oct 04 '17 at 20:37
  • 1
    Updated my answer. Glad it helped – Maxim Oct 04 '17 at 20:45