1

Suppose I have a training set of (x, y) pairs, where x is the input example and y is the corresponding target and y is a value (1 ... k) (k is the number of classes).

When calculating the likelihood of the training set, should it be calculated for the whole training set (all of the examples), that is:

L = P(y | x) = p(y1 | x1) * p(y2 | x2) * ...

Or is the likelihood computed for a specific training example (x, y)?

I'm asking because I saw these lecture notes (page 2), where he seems to calculate L_i, that is the likelihood for every training example separately.

nbro
  • 15,395
  • 32
  • 113
  • 196
Cheshie
  • 2,777
  • 6
  • 32
  • 51

1 Answers1

4

The likelihood function describes the probability of generating a set of training data given some parameters and can be used to find those parameters which generate the training data with maximum probability. You can create the likelihood function for a subset of the training data, but that wouldn't be represent the likelihood of the whole data. What you can do however (and what is apparently silently done in the lecture notes) is to assume that your data is independent and identically distributed (iid). Therefore, you can split the joint probability function into smaller pieces, i.e. p(x|theta) = p(x1|theta) * p(x2|theta) * ... (based on the independence assumption), and you can use the same function with the same parameters (theta) for each of these pieces, e.g. a normal distribution (based on the identicality assumption). You can then use the logarithm to turn the product into a sum, i.e. p(x|theta) = p(x1|theta) + p(x2|theta) + .... That function can be maximized by setting its derivative to zero. The resulting maximum is the theta which creates your x with maximum probability, i.e. your maximum likelihood estimator.

aleju
  • 2,376
  • 1
  • 17
  • 10
  • Thanks @user3760780. In the lecture notes, the `pi` (the multiplication) runs from 1 to _k_ (the number of classes), therefore it does _not_ run on the whole data set; the likelihood seems to be calculated separately for every training example (denoted by L_i). My question is why he did it that way. – Cheshie Jun 04 '15 at 16:25
  • I think you are talking about `Li(w1, ..., wk) = log prod[k=1 to K](p(k|xi)^yik)`. In that, `p(k|xi)` should compute the probability of generating label `k` for example `xi`. As `yik` is `0` for every wrong label and `1` for the correct one, you get `p(k|xi)^1` for the correct label and just `1` (no change) for all others. So `Li` contains the probability of the correct label for example `i`, given `xi` and the weights `w`. Afterwards, the weights are updated based on that single example, which is standard [stochastic gradient descent](http://en.wikipedia.org/wiki/Stochastic_gradient_descent). – aleju Jun 04 '15 at 17:38
  • Thanks @user3760780, that's exactly the part I'm asking about. If I understand you correctly, you say that when applying stochastic gradient descent, you use the likelihood for only _a single_ training example, and not the whole training set? – Cheshie Jun 04 '15 at 21:25
  • Yes, you would make one step of gradient descent based on the gradient of the likelihood of one example or a small batch of examples (e.g. 128 examples). – aleju Jun 04 '15 at 22:18