2

According to this video the substantive difference between the perceptron and gradient descent algorithms are quite minor. They specified it as essentially:

Perceptron: Δwi = η(y - ŷ)xi

Gradient Descent: Δwi = η(y - α)xi

I've implemented a working version of the perceptron algorithm, but I don't understand what sections I need to change to turn it into gradient descent.

Below is the load-bearing portions of my perceptron code, I suppose that these are the components I need to modify. But where? What do I need to change? I don't understand.

This is left for pedagogical reasons, I've sort of figured this out but am still confused about the gradient, please see UPDATE below

      iteration = 0;
      do 
      {
          iteration++;
          globalError = 0;
          //loop through all instances (complete one epoch)
          for (p = 0; p < number_of_files__train; p++) 
          {
              // calculate predicted class
              output = calculateOutput( theta, weights, feature_matrix__train, p, globo_dict_size );
              // difference between predicted and actual class values
              localError = outputs__train[p] - output;
              //update weights and bias
              for (int i = 0; i < globo_dict_size; i++) 
              {
                  weights[i] += ( LEARNING_RATE * localError * feature_matrix__train[p][i] );
              }
              weights[ globo_dict_size ] += ( LEARNING_RATE * localError );

              //summation of squared error (error value for all instances)
              globalError += (localError*localError);
          }

          /* Root Mean Squared Error */
          if (iteration < 10) 
              System.out.println("Iteration 0" + iteration + " : RMSE = " + Math.sqrt( globalError/number_of_files__train ) );
          else
              System.out.println("Iteration " + iteration + " : RMSE = " + Math.sqrt( globalError/number_of_files__train ) );
      } 
      while(globalError != 0 && iteration<=MAX_ITER);

This is the crux of my perceptron:

  static int calculateOutput( int theta, double weights[], double[][] feature_matrix, int file_index, int globo_dict_size )
  {
     //double sum = x * weights[0] + y * weights[1] + z * weights[2] + weights[3];
     double sum = 0;

     for (int i = 0; i < globo_dict_size; i++) 
     {
         sum += ( weights[i] * feature_matrix[file_index][i] );
     }
     //bias
     sum += weights[ globo_dict_size ];

     return (sum >= theta) ? 1 : 0;
  }

Is it just that I replace that caculateOutput method with something like this:

public static double [] gradientDescent(final double [] theta_in, final double alpha, final int num_iters, double[][] data ) 
{
    final double m = data.length;   
    double [] theta = theta_in;
    double theta0 = 0;
    double theta1 = 0;
    for (int i = 0; i < num_iters; i++) 
    {                        
        final double sum0 = gradientDescentSumScalar0(theta, alpha, data );
        final double sum1 = gradientDescentSumScalar1(theta, alpha, data);                                   
        theta0 = theta[0] - ( (alpha / m) * sum0 ); 
        theta1 = theta[1] - ( (alpha / m) * sum1 );                        
        theta = new double [] { theta0, theta1 };
    }
    return theta;
}

UPDATE EDIT


At this point I think I'm very close.

I understand how to calculate the hypothesis and I think I've done that correctly, but nevertheless, something remains terribly wrong with this code. I'm pretty sure it has something to do with my calculation of the gradient. When I run it the error fluctuates wildly and then goes to infinity then just NaaN.

  double cost, error, hypothesis;
  double[] gradient;
  int p, iteration;

  iteration = 0;
  do 
  {
    iteration++;
    error = 0.0;
    cost = 0.0;

    //loop through all instances (complete one epoch)
    for (p = 0; p < number_of_files__train; p++) 
    {

      // 1. Calculate the hypothesis h = X * theta
      hypothesis = calculateHypothesis( theta, feature_matrix__train, p, globo_dict_size );

      // 2. Calculate the loss = h - y and maybe the squared cost (loss^2)/2m
      cost = hypothesis - outputs__train[p];

      // 3. Calculate the gradient = X' * loss / m
      gradient = calculateGradent( theta, feature_matrix__train, p, globo_dict_size, cost, number_of_files__train);

      // 4. Update the parameters theta = theta - alpha * gradient
      for (int i = 0; i < globo_dict_size; i++) 
      {
          theta[i] = theta[i] - LEARNING_RATE * gradient[i];
      }

    }

    //summation of squared error (error value for all instances)
    error += (cost*cost);       

  /* Root Mean Squared Error */
  if (iteration < 10) 
      System.out.println("Iteration 0" + iteration + " : RMSE = " + Math.sqrt(  error/number_of_files__train  ) );
  else
      System.out.println("Iteration " + iteration + " : RMSE = " + Math.sqrt( error/number_of_files__train ) );
  //System.out.println( Arrays.toString( weights ) );

  } 
  while(cost != 0 && iteration<=MAX_ITER);


}

static double calculateHypothesis( double[] theta, double[][] feature_matrix, int file_index, int globo_dict_size )
{
    double hypothesis = 0.0;

     for (int i = 0; i < globo_dict_size; i++) 
     {
         hypothesis += ( theta[i] * feature_matrix[file_index][i] );
     }
     //bias
     hypothesis += theta[ globo_dict_size ];

     return hypothesis;
}

static double[] calculateGradent( double theta[], double[][] feature_matrix, int file_index, int globo_dict_size, double cost, int number_of_files__train)
{
    double m = number_of_files__train;

    double[] gradient = new double[ globo_dict_size];//one for bias?

    for (int i = 0; i < gradient.length; i++) 
    {
        gradient[i] = (1.0/m) * cost * feature_matrix[ file_index ][ i ] ;
    }

    return gradient;
}
smatthewenglish
  • 2,831
  • 4
  • 36
  • 72
  • In your updated version where you say it fluctuates a lot, have you tried decreasing the learning rate? Gradient descent can be very unstable for a too high learning rate. – Acrofales Mar 09 '15 at 19:51
  • @Acrofales I guess that's part of it but not all, what do you think about [this](http://stackoverflow.com/questions/28988732/correct-implementation-of-hinge-loss-minimization-for-gradient-descent) – smatthewenglish Mar 11 '15 at 14:05

1 Answers1

1

The perceptron rule is just an approximation to the gradient descent when you have non-differentiable activation functions like (sum >= theta) ? 1 : 0. As they ask in the end of the video, you cannot use gradients there because this threshold function isn't differentiable (well, its gradient is not defined for x=0 and the gradient is zero everywhere else). If, instead of this thresholding, you had a smooth function like the sigmoid you could calculate the actual gradients.

In that case your weight update would be LEARNING_RATE * localError * feature_matrix__train[p][i] * output_gradient[i]. For the case of sigmoid, the link I sent you also shows how to calculate the output_gradient.

In summary to change from perceptrons to gradient descent you have to

  1. Use an activation function whose derivative (gradient) is not zero everywhere.
  2. Apply the chain rule to define the new update rule
ejr
  • 120
  • 1
  • 7
  • so substitute `(sum >= theta) ? 1 : 0` for a sigmoid function [**like this one**](http://stackoverflow.com/questions/2887815/speeding-up-math-calculations-in-java) and substitute `weights[i] += ( LEARNING_RATE * localError * feature_matrix__train[p][i] );` for that update rule you specificed, i.e. `LEARNING_RATE * localError * feature_matrix__train[p][i] * output_gradient[i]` and that's it? I'm pretty daft, could you please be more explicit about the calculation of output gradient? – smatthewenglish Mar 08 '15 at 04:24
  • I just posted an update toward the end of the question, is that on the right track? – smatthewenglish Mar 08 '15 at 04:31
  • You can get `output_gradient` with a function similar to `calculateOutput`, but this time you have to `return sigmoid(sum) * (1-sigmoid(sum))` which is the gradient of `sigmoid` with respect to `sum`. So, I think you should just enhance your `caculateOutput` to return both. Also, make sure you get the idea behind [gradient descent](https://theclevermachine.wordpress.com/2014/09/06/derivation-error-backpropagation-gradient-descent-for-neural-networks/). It will only get more and more funny from that point on. I hope that works. – ejr Mar 08 '15 at 19:29
  • how about [this](http://stackoverflow.com/questions/28988732/correct-implementation-of-hinge-loss-minimization-for-gradient-descent)? – smatthewenglish Mar 11 '15 at 14:06