I've watched the Andrew Ng videos over and over and still I don't understand how to apply gradient descent to my problem.
He deals pretty much exclusively in the realm of high level conceptual explanations but what I need are ground level tactical insights.
My input are feature vectors of the form:
Example:
Document 1 = ["I", "am", "awesome"]
Document 2 = ["I", "am", "great", "great"]
Dictionary is:
["I", "am", "awesome", "great"]
So the documents as a vector would look like:
Document 1 = [1, 1, 1, 0]
Document 2 = [1, 1, 0, 2]
According to what I've seen the algorithm for gradient descent looks like this:
It is my current understanding that α is the learning rate, x(i) is a feature, in the above example for Document 2
, x(3)=2.
y(i) is the label, in my case I'm trying to predict the Document
associated with a particular feature vector so for instance y(0) would be associated with Document 1
, & y(1) would represent Document 2
.
There will be potentially many documents, let's say 10, so I could have 5 docuements associated with y(0) and 5 documents associated with y(1), in such case m = 10
.
The first thing I don't really understand is, what is the role of Θ0 & Θ1?
I suppose that they are the weight values, as with the perceptron algorithm, I apply them to the value of the feature in an effort to coax that feature, regardless of its inherent value, to output the value of the label with which it is associated. Is that correct? So I've been equating the Θ values with the weight values of perceptron, is this accurate?
Moreover I don't understand what we're taking the gradient of. I really don't care to hear another high level explaination about walking on hills and whatnot, practically speaking, for the situation I've just detailed above, what are we taking the gradient of? Weights in two subsequent iterations? The value of a feature and it's true label?
Thank you for your consideration, any insight would be greatly appreciated.