How do you apply Simple Gradient Descent algorithm from any library in a real application?

Question

I've looked at the proposed questions that may have my answer, and I don't think this is a duplicate. If it is, it's because I need something even more basic so I speak the language. I'm willing to take downvotes/lose rep to figure this out, if someone can point me at the right material. I may even be in the wrong forum.

I'm an experienced programmer, but a non-mathematician. I am so lost that I don't even know how to phrase this question.

I'm trying to implement a machine learning component in an application, and I can see the rough outline of what I need to do, but the library manuals are all written in greek. I've got that "alpha" is the learning rate, and "theta" is a matrix of floating point numbers (aka neural network).

I've been reviewing Andrew Ng's stanford lectures, and they have helped me to understand that the final application of the algorithm will entail a Visitor to apply the neural network matrix (Theta) to "stuff". The math is the same whether you're trying to extrapolate new feature sets or producing an output. Yay!

I can see how to componentize a learning engine class once it is implemented (why hasn't someone else done it?), But I don't understand how to implement the stuff inside the componentized envelope. Part of the problem is that the libraries and examples (Apache Commons Math, TensorFlow, etc.) all assume that you are a mathematician first, so they speak the language of mathematicians rather than programmers.

Can someone explain without using words like theta, derivative, LUDecomposition, Eigen, or a stream of alphabet soup, exactly how to use the libraries once you have the inputs laid out nicely?

// The final code should look something like this, I think
public void train () {

    // do something involving alphabet soup and theta here
    // might be "Stochastic Gradient Descent?" 

    // new model = stochasticGradientDescent(model)
    // hypothesis = applyModel (newModel)
    // difference = (hypothesis - actual)**2

}

As for no research effort, I've been banging my head against the question for two months now and am still lost. I'm just now to the point I can even attempt to formulate the question. — pojo-guy, Jun 26 '17 at 21:09
What makes you think you will be able to use the libraries without understanding those terms? You should maybe invest the time to learn something about the maths; otherwise you'll just be blindly pulling levers until it eventually works (or, frustratingly, it doesn't). — Andy Turner, Jun 26 '17 at 21:11
I'm beyond "frustratingly" now. I need to spend time bringing in the paychecks, and I've run out of sleeping time to rob from to take three years of maths. I know that the libraries are implementing nested loops and simple arithmetic (I have figured out that much of the math and read the code), so I trust them. I could code the algorithms from scratch at this point, but why do that if there are perfectly good libraries that are better engineered and interchangeable that I can "pull levers" on? — pojo-guy, Jun 26 '17 at 21:19
In what may be a real-life application of this within SO, a "Related" article referred me to this posting: https://stackoverflow.com/questions/35711315/gradient-descent-vs-stochastic-gradient-descent-algorithms?rq=1 which is beginning to be helpful — pojo-guy, Jun 26 '17 at 22:10
Andrew Ng makes math errors when lecturing (which he warns about). Since his lectures are aimed at PhD level mathematicians (who are not necessarily programmers, based on the amount of time he spends emphasizing the "becomes equal" relationship), the intended audience can spot and correct those errors mentally. I need to find a better tutorial geared for programmers. — pojo-guy, Jun 27 '17 at 23:46

score 1 · Answer 1 · edited Jun 20 '20 at 09:12

The first important aspect to write a gradient descent is identifying the feature and developing the formula(hypothesis) which can define the relationship between your input which are a setup feature and parameter(thetha). This we can understand better with few examples.

Lets say I am user of netflix who likes action movie. In mathematical term I will allocate a number to the user. That value can be anything between 0 to 1.Mostly theta value remains unknown and it has to be derived using method as alternative least square(ALS). Action can be a feature and that should also be given a unit. I will give more weight to a heavy action movie and less where action scenes are less.The output can be how much you like the movie. You have more preference for action movie than I will rate it as 5. For less preference the rating will be 1.

Once we have the feature and rating are with us, the next step is to identify the hypothesis. Hypothesis can be a linear function or polynomial function based on the feature list. We have considered only on feature and we can use a simple linear function.

User likability for movie = User parameter to watch a movie + User parameter to watch a action movie * Action scenes in the movie

More precisely in mathematical term it can be written

Y = theta0 + theta1*x

Now we know the know the value for Y and x. theta0 and theta1 value are unknown. This can be derived using the various gradient descent methods. I will not go into the detail how the gradient descent formula is defined from the above hypothesis. we can use the below formula of gradient descent.

theta0 := theta0 - learing rate* sum of all training set(actual rating - rating derived using hypothesis)/total no of training set

theta0 := theta0 - (learing rate* sum of all training set(actual rating - rating derived using hypothesis)/total no of training set) * feature number

In your train method, the first step is to provide a starting value for theta0 and theta1. By convention the value starts from 0.1. The learning rate controls how fast the convergence will happen. It will control the speed to reach the final theta value.

In the second step of your training method you loop through the training set.For Stochastic Gradient descent, you have split your training dataset into multiple batches. The theta0 value will be calculated on individual batch dataset and they are passed as initial theta values to other batches. This method should be used when training set size is quite big in millions.

public Parameter train(List<UserSkuMatrix> ratings, User user) {
    
    double theta0=0.1,theta1=0.1;
    double tempTheta0=0,tempTheta1=0;
    
    for(int i = 0;i<iteration;i++) {
    
        if(verifyConvergence(theta0, tempTheta0) 
                && verifyConvergence(theta1, tempTheta1)) {
        
            break;
        }

        tempTheta0 = theta0;
        tempTheta1 = theta1;

        
        theta0 = theta0 - gradientDesent(ratings, theta0,theta1,1);
        theta1 = theta1 - gradientDesent(ratings, theta0,theta1,2);
        
    }
    
    return p;   
}
protected boolean verifyConvergence(double theta, double tempTheta) {
    
    return (theta - tempTheta) < GLOBAL_MINIMUM;
}
protected double partialDerivative(List<UserSkuMatrix> ratings, double theta0, double theta1, int factor){
    
    double sum=0.0;
    
    for(UserSkuMatrix d:ratings) {
        
        double x = d.getSku().getFeature1Value(), 
                y = d.getRank(), x1=d.getSku().getFeature2Value();

        Hypothesis h = new Hypothesis(p, x, x1);
    
        double s = (h.hypothesis2()-y);
        
        if(factor == 2) {
            s = s*x;
        } else if( factor==3) {
            s = s*x1;
        }
    
        sum = sum + s;
    }
    
    return sum;
}
public double gradientDesent(List<UserSkuMatrix> ratings, double theta0, double theta1, int factor) {
    double m = ratings.size();
    double total = partialDerivative(ratings,theta0,theta1,factor);
    return (1.0 * total) / m;
}

Once you derive the theta0 and theta1 , your model will be ready. This values can be persisted in a file or database. The model can be used to predict the user preference for new action movie to be released in future.

Apache flink also provides a nice implementation for stochastic gradient descent. https://ci.apache.org/projects/flink/flink-docs-release-1.2/

How do you apply Simple Gradient Descent algorithm from any library in a real application?

1 Answers1