gradient descent as applied to feature vector bag of words classification task

Question

I've watched the Andrew Ng videos over and over and still I don't understand how to apply gradient descent to my problem.

He deals pretty much exclusively in the realm of high level conceptual explanations but what I need are ground level tactical insights.

My input are feature vectors of the form:

Example:

Document 1 = ["I", "am", "awesome"]
Document 2 = ["I", "am", "great", "great"]

Dictionary is:

["I", "am", "awesome", "great"]

So the documents as a vector would look like:

Document 1 = [1, 1, 1, 0]
Document 2 = [1, 1, 0, 2]

According to what I've seen the algorithm for gradient descent looks like this:

enter image description here

It is my current understanding that α is the learning rate, x⁽ⁱ⁾ is a feature, in the above example for Document 2, x⁽³⁾=2.

y⁽ⁱ⁾ is the label, in my case I'm trying to predict the Document associated with a particular feature vector so for instance y⁽⁰⁾ would be associated with Document 1, & y⁽¹⁾ would represent Document 2.

There will be potentially many documents, let's say 10, so I could have 5 docuements associated with y⁽⁰⁾ and 5 documents associated with y⁽¹⁾, in such case m = 10.

The first thing I don't really understand is, what is the role of Θ₀ & Θ₁?

I suppose that they are the weight values, as with the perceptron algorithm, I apply them to the value of the feature in an effort to coax that feature, regardless of its inherent value, to output the value of the label with which it is associated. Is that correct? So I've been equating the Θ values with the weight values of perceptron, is this accurate?

Moreover I don't understand what we're taking the gradient of. I really don't care to hear another high level explaination about walking on hills and whatnot, practically speaking, for the situation I've just detailed above, what are we taking the gradient of? Weights in two subsequent iterations? The value of a feature and it's true label?

Thank you for your consideration, any insight would be greatly appreciated.

Can you edit your question with a detailed and clear description of what your problem is and what you are trying to predict? The solution we give you may not use the Andrew Ng videos. — Tim Biegeleisen, Mar 07 '15 at 15:16
This should be split in at least two questions: your theoretical questions and the code. Narrow down any problems you have with the code and ask specific questions about those. — IVlad, Mar 07 '15 at 21:46
@TimBiegeleisen I'm trying to predict the Document associated with a bag-of-words feature vector — smatthewenglish, Mar 08 '15 at 04:47

score 3 · Answer 1 · answered Mar 07 '15 at 21:45

He deals pretty much exclusively in the realm of high level conceptual explanations but what I need are ground level tactical insights.

I found his videos the most practical and "ground level", especially since there is also code you can look at. Have you looked at it?

It is my current understanding that α is the learning rate, x(i) is a feature, in the above example for Document 2, x(3)=2.

Correct about α, wrong about x(i): x(i) is an instance or a sample. In your example, you have:

Document 1 = [1, 1, 1, 0] = x(1)
Document 2 = [1, 1, 0, 2] = x(2)

A feature would be x(1, 2) = 1, for example.

y(i) is the label, in my case I'm trying to predict the Document associated with a particular feature vector so for instance y(0) would be associated with Document 1, & y(1) would represent Document 2.

Correct. Although I believe Andrew Ng's lectures use 1-based indexing, so that would be y(1) and y(2).

There will be potentially many documents, let's say 10, so I could have 5 docuements associated with y(0) and 5 documents associated with y(1), in such case m = 10.

That's not how you should look at it. Each document will have its own label (an y value). Whether or not the labels are equal among them is another story. Document 1 will have label y(1) and document 5 will have label y(5). Whether or not y(1) == y(5) is irrelevant so far.

The first thing I don't really understand is, what is the role of Θ0 & Θ1?

Theta0 and Theta1 represent your model, which is the thing you use to predict your labels:

prediction = Theta * input
           = Theta0 * input(0) + Theta1 * input(1)

Where input(i) is the value of a feature, and input(0) is usually defined as always being equal to 1.

Of course, since you have more than one feature, you will need more than two Theta values. Andrew Ng goes on to generalize this process for more features in the lectures following the one where he presents the formula you posted.

I suppose that they are the weight values, as with the perceptron algorithm, I apply them to the value of the feature in an effort to coax that feature, regardless of its inherent value, to output the value of the label with which it is associated. Is that correct? So I've been equating the Θ values with the weight values of perceptron, is this accurate?

Yes, that is correct.

Moreover I don't understand what we're taking the gradient of. I really don't care to hear another high level explaination about walking on hills and whatnot, practically speaking, for the situation I've just detailed above, what are we taking the gradient of? Weights in two subsequent iterations? The value of a feature and it's true label?

First of all, do you know what a gradient is? It's basically an array of partial derivatives, so it's easier to explain what we're taking the partial derivative of and with respect to what.

We are taking the partial derivative of the cost function (defined in Andrew Ng's lecture as the difference squared) with respect to each Theta value. All of these partial derivatives make up the gradient.

I really don't know how to explain it more practically. The closest from what you listed would be "the value of a feature and its true label", because the cost function tells us how good our model is, and its partial derivatives with respect to the weight of each feature kinda tell us how bad each weight is so far.

You seem to be confusing features and samples again. A feature does not have labels. Samples or instances have labels. Samples or instances also consist of features.

it seems the division of theoretical question and code question has taken place, would you please perhaps consider [**this question**](http://stackoverflow.com/questions/28923292/calculate-gradient-output-for-theta-update-rule) more specifically related to the code? — smatthewenglish, Mar 08 '15 at 05:14

gradient descent as applied to feature vector bag of words classification task

1 Answers1

Linked