Questions tagged [gradient-descent]

Gradient Descent is an algorithm for finding the minimum of a function. It iteratively calculates partial derivatives (gradients) of the function and descends in steps proportional to those partial derivatives. One major application of Gradient Descent is fitting a parameterized model to a set of data: the function to be minimized is an error function for the model.

Wiki:

Gradient descent is a first-order iterative optimization algorithm. It is an optimization algorithm used to find the values of parameters (coefficients) of a function (f) that minimizes a cost function (cost).

To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point.

Gradient descent is best used when the parameters cannot be calculated analytically (e.g. using linear algebra) and must be searched for by an optimization algorithm.

Gradient descent is also known as steepest descent, or the method of steepest descent.


Tag usage:

Questions on should be about implementation and programming problems, not about the theoretical properties of the optimization algorithm. Consider whether your question might be better suited to Cross Validated, the StackExchange site for statistics, machine learning and data analysis.


Read more:

1428 questions
6
votes
1 answer

How to obtain the convex curve for weights vs loss in a neural network

In most of the literature of Neural networks the 3D plot of weights, bias and the loss function is shown as below, When I tried I obtained a plot like this one Here are more details, Here is the glimpse of the dataset, there are 15,000 training…
6
votes
1 answer

Logistic Regression Gradient Descent

I have to do Logistic regression using batch gradient descent. import numpy as np X = np.asarray([ [0.50],[0.75],[1.00],[1.25],[1.50],[1.75],[1.75], [2.00],[2.25],[2.50],[2.75],[3.00],[3.25],[3.50], [4.00],[4.25],[4.50],[4.75],[5.00],[5.50]]) y =…
6
votes
1 answer

The gradient of an output w.r.t network weights that holds another output constant

Let's assume I have a simple MLP And I have a gradient of some loss function with respect to the output layer to get G = [0, -1] (that is, increasing the second output variable decreases the loss function). If I take the gradient of G with respect…
Robert
  • 1,132
  • 2
  • 11
  • 26
6
votes
1 answer

Steepest descent spitting out unreasonably large values

My implementation of steepest descent for solving Ax = b is showing some weird behavior: for any matrix large enough (~10 x 10, have only tested square matrices so far), the returned x contains all huge values (on the order of 1x10^10). def…
6
votes
1 answer

The grad function in both the {pracma} and the {numDeriv} libraries of R gives erroneous results

I am interested in the 1st order numerical derivative of a self-defined function pTgh_y(q,g,h) with respect to q. For a special case, pTgh_y(q,0,0) = pnorm(q). In other words pTgh_y(q,g,h) is reduced to the CDF of the standard normal when g=h=0 (see…
Ye Tian
  • 353
  • 1
  • 2
  • 17
6
votes
1 answer

What's different about momentum gradient update in Tensorflow and Theano like this?

I'm trying to use TensorFlow with my deep learning project. Here I need implement my gradient update in this formula : I have also implement this part in Theano, and it came out the expected answer. But when I try to use TensorFlow's…
Peter Yang
  • 211
  • 2
  • 8
6
votes
3 answers

Will larger batch size make computation time less in machine learning?

I am trying to tune the hyper parameter i.e batch size in CNN.I have a computer of corei7,RAM 12GB and i am training a CNN network with CIFAR-10 dataset which can be found in this blog.Now At first what i have read and learnt about batch size in…
6
votes
4 answers

TensorFlow's ReluGrad claims input is not finite

I'm trying out TensorFlow and I'm running into a strange error. I edited the deep MNIST example to use another set of images, and the algorithm converges nicely again, until around iteration 8000 (accuracy 91% at that point) when it crashes with the…
user1111929
  • 6,050
  • 9
  • 43
  • 73
6
votes
4 answers

Gradient descent and normal equation method for solving linear regression gives different solutions

I'm working on machine learning problem and want to use linear regression as learning algorithm. I have implemented 2 different methods to find parameters theta of linear regression model: Gradient (steepest) descent and Normal equation. On the same…
Rasto
  • 17,204
  • 47
  • 154
  • 245
5
votes
1 answer

Why does Pytorch autograd need a scalar?

I am working through "Deep Learning for Coders with fastai & Pytorch". Chapter 4 introduces the autograd function from the PyTorch library on a trivial example. x = tensor([3.,4.,10.]).requires_grad_() def f(q): return sum(q**2) y =…
Mack
  • 53
  • 3
5
votes
1 answer

Gradient descent for ridge regression

I'm trying to write a code that return the parameters for ridge regression using gradient descent. Ridge regression is defined as Where, L is the loss (or cost) function. w are the parameters of the loss function (which assimilates b). x are the…
immb31
  • 75
  • 1
  • 1
  • 6
5
votes
2 answers

Gradient descent using TensorFlow is much slower than a basic Python implementation, why?

I'm following a machine learning course. I have a simple linear regression (LR) problem to help me get used to TensorFlow. The LR problem is to find parameters a and b such that Y = a*X + b approximates an (x, y) point cloud (which I generated…
Stefan
  • 919
  • 2
  • 13
  • 24
5
votes
2 answers

Why is softmax classifier gradient divided by batch size (CS231n)?

Question In CS231 Computing the Analytic Gradient with Backpropagation which is first implementing a Softmax Classifier, the gradient from (softmax + log loss) is divided by the batch size (number of data being used in a cycle of forward cost…
mon
  • 18,789
  • 22
  • 112
  • 205
5
votes
1 answer

Why is Loss of SGD for a dataset is not matching the pytorch code with the scratch python code for linear regression?

I'm trying to implement Multiple Linear regression on the wine dataset. But when I compare the results of Pytorch with scratch code of Python the losses are not coming same. My Scratch Code: Functions: def yinfer(X, beta): return beta[0] +…
5
votes
2 answers

If we can clip gradient in WGAN, why bother with WGAN-GP?

I am working on WGAN and would like to implement WGAN-GP. In its original paper, WGAN-GP is implemented with a gradient penalty because of the 1-Lipschitiz constraint. But packages out there like Keras can clip the gradient norm at 1 (which by…