The first important aspect to write a gradient descent is identifying the feature and developing the formula(hypothesis) which can define the relationship between your input which are a setup feature and parameter(thetha). This we can understand better with few examples.
Lets say I am user of netflix who likes action movie. In mathematical term I will allocate a number to the user. That value can be anything between 0 to 1.Mostly theta value remains unknown and it has to be derived using method as alternative least square(ALS). Action can be a feature and that should also be given a unit. I will give more weight to a heavy action movie and less where action scenes are less.The output can be how much you like the movie. You have more preference for action movie than I will rate it as 5. For less preference the rating will be 1.
Once we have the feature and rating are with us, the next step is to identify the hypothesis. Hypothesis can be a linear function or polynomial function based on the feature list. We have considered only on feature and we can use a simple linear function.
User likability for movie = User parameter to watch a movie + User parameter to watch a action movie * Action scenes in the movie
More precisely in mathematical term it can be written
Y = theta0 + theta1*x
Now we know the know the value for Y and x. theta0 and theta1 value are unknown. This can be derived using the various gradient descent methods. I will not go into the detail how the gradient descent formula is defined from the above hypothesis. we can use the below formula of gradient descent.
theta0 := theta0 - learing rate* sum of all training set(actual rating - rating derived using hypothesis)/total no of training set
theta0 := theta0 - (learing rate* sum of all training set(actual rating - rating derived using hypothesis)/total no of training set) * feature number
In your train method, the first step is to provide a starting value for theta0 and theta1. By convention the value starts from 0.1. The learning rate controls how fast the convergence will happen. It will control the speed to reach the final theta value.
In the second step of your training method you loop through the training set.For Stochastic Gradient descent, you have split your training dataset into multiple batches. The theta0 value will be calculated on individual batch dataset and they are passed as initial theta values to other batches. This method should be used when training set size is quite big in millions.
public Parameter train(List<UserSkuMatrix> ratings, User user) {
double theta0=0.1,theta1=0.1;
double tempTheta0=0,tempTheta1=0;
for(int i = 0;i<iteration;i++) {
if(verifyConvergence(theta0, tempTheta0)
&& verifyConvergence(theta1, tempTheta1)) {
break;
}
tempTheta0 = theta0;
tempTheta1 = theta1;
theta0 = theta0 - gradientDesent(ratings, theta0,theta1,1);
theta1 = theta1 - gradientDesent(ratings, theta0,theta1,2);
}
return p;
}
protected boolean verifyConvergence(double theta, double tempTheta) {
return (theta - tempTheta) < GLOBAL_MINIMUM;
}
protected double partialDerivative(List<UserSkuMatrix> ratings, double theta0, double theta1, int factor){
double sum=0.0;
for(UserSkuMatrix d:ratings) {
double x = d.getSku().getFeature1Value(),
y = d.getRank(), x1=d.getSku().getFeature2Value();
Hypothesis h = new Hypothesis(p, x, x1);
double s = (h.hypothesis2()-y);
if(factor == 2) {
s = s*x;
} else if( factor==3) {
s = s*x1;
}
sum = sum + s;
}
return sum;
}
public double gradientDesent(List<UserSkuMatrix> ratings, double theta0, double theta1, int factor) {
double m = ratings.size();
double total = partialDerivative(ratings,theta0,theta1,factor);
return (1.0 * total) / m;
}
Once you derive the theta0 and theta1 , your model will be ready. This values can be persisted in a file or database. The model can be used to predict the user preference for new action movie to be released in future.
Apache flink also provides a nice implementation for stochastic gradient descent. https://ci.apache.org/projects/flink/flink-docs-release-1.2/