Questions tagged [gradient-descent]

Gradient Descent is an algorithm for finding the minimum of a function. It iteratively calculates partial derivatives (gradients) of the function and descends in steps proportional to those partial derivatives. One major application of Gradient Descent is fitting a parameterized model to a set of data: the function to be minimized is an error function for the model.

Wiki:

Gradient descent is a first-order iterative optimization algorithm. It is an optimization algorithm used to find the values of parameters (coefficients) of a function (f) that minimizes a cost function (cost).

To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point.

Gradient descent is best used when the parameters cannot be calculated analytically (e.g. using linear algebra) and must be searched for by an optimization algorithm.

Gradient descent is also known as steepest descent, or the method of steepest descent.


Tag usage:

Questions on should be about implementation and programming problems, not about the theoretical properties of the optimization algorithm. Consider whether your question might be better suited to Cross Validated, the StackExchange site for statistics, machine learning and data analysis.


Read more:

1428 questions
15
votes
2 answers

Vectorization of a gradient descent code

I am implementing a batch gradient descent on Matlab. I have a problem with the update step of theta. theta is a vector of two components (two rows). X is a matrix containing m rows (number of training samples) and n=2 columns (number of…
bigTree
  • 2,103
  • 6
  • 29
  • 45
15
votes
2 answers

Gradient descent convergence How to decide convergence?

I learnt gradient descent through online resources (namely machine learning at coursera). However the information provided only said to repeat gradient descent until it converges. Their definition of convergence was to use a graph of the cost…
Terence Chow
  • 10,755
  • 24
  • 78
  • 141
13
votes
1 answer

Suboptimal convergence in PyTorch compared to TensorFlow when using Adam optimizer

My program for training a model in PyTorch converges worse than the TensorFlow implementation. When I switch to SGD instead of Adam, the losses are identical. With Adam, the losses are different starting at the very first epoch. I believe I'm using…
FFT
  • 929
  • 8
  • 17
13
votes
2 answers

why too many epochs will cause overfitting?

I am reading the a deep learning with python book. After reading chapter 4, Fighting Overfitting, I have two questions. Why might increasing the number of epochs cause overfitting? I know increasing increasing the number of epochs will involve…
NingLee
  • 1,477
  • 2
  • 17
  • 26
13
votes
1 answer

Using R for multi-class logistic regression

Short format: How to implement multi-class logistic regression classification algorithms via gradient descent in R? Can optim() be used when there are more than two labels? The MatLab code is: function [J, grad] = cost(theta, X, y, lambda) m =…
Antoni Parellada
  • 4,253
  • 6
  • 49
  • 114
13
votes
5 answers

What are alternatives of Gradient Descent?

Gradient Descent has a problem of Local Minima. We need run gradient descent exponential times for to find global minima. Can anybody tell me about any alternatives of gradient descent with their pros and cons. Thanks.
12
votes
2 answers

Difference between autograd.grad and autograd.backward?

Suppose I have my custom loss function and I want to fit the solution of some differential equation with help of my neural network. So in each forward pass, I am calculating the output of my neural net and then calculating the loss by taking the MSE…
12
votes
1 answer

Full gradient descent in keras

I am trying to implement full gradient descent in keras. This means that for each epoch I am training on the entire dataset. This is why the batch size is defined to be the length size of the training set. from keras.models import Sequential from…
user552231
  • 1,095
  • 3
  • 21
  • 40
12
votes
1 answer

What's the triplet loss back propagation gradient formula?

I am trying to use caffe to implement triplet loss described in Schroff, Kalenichenko and Philbin "FaceNet: A Unified Embedding for Face Recognition and Clustering", 2015. I am new to this so how to calculate the gradient in back propagation?
12
votes
3 answers

Gradient descent in Java

I've recently started the AI-Class at Coursera and I've a question related to my implementation of the gradient descent algorithm. Here's my current implementation (I actually just "translated" the mathematical expressions into Java code): …
Bastian
  • 1,553
  • 13
  • 33
11
votes
4 answers

Are there alternatives to backpropagation?

I know a neural network can be trained using gradient descent and I understand how it works. Recently, I stumbled upon other training algorithms: conjugate gradient and quasi-Newton algorithms. I tried to understand how they work but the only good…
Nope
  • 153
  • 1
  • 6
11
votes
3 answers

How to get around in place operation error if index leaf variable for gradient update?

I am encountering In place operation error when I am trying to index a leaf variable to update gradients with customized Shrink function. I cannot work around it. Any help is highly appreciated! import torch.nn as nn import torch import numpy as…
W.S.
  • 647
  • 1
  • 6
  • 19
11
votes
2 answers

How to determine the learning rate and the variance in a gradient descent algorithm?

I started to learn the machine learning last week. when I want to make a gradient descent script to estimate the model parameters, I came across a problem: How to choose a appropriate learning rate and variance。I found that,different (learning…
zhoufanking
  • 133
  • 1
  • 1
  • 7
10
votes
1 answer

R: implementing my own gradient boosting algorithm

I am trying to write my own gradient boosting algorithm. I understand there are existing packages like gbm and xgboost, but I wanted to understand how the algorithm works by writing my own. I am using the iris data set, and my outcome is…
Adrian
  • 9,229
  • 24
  • 74
  • 132
10
votes
3 answers

Tensorflow 2.0 doesn't compute the gradient

I want to visualize the patterns that a given feature map in a CNN has learned (in this example I'm using vgg16). To do so I create a random image, feed through the network up to the desired convolutional layer, choose the feature map and find the…
Will
  • 165
  • 1
  • 1
  • 8
1 2
3
95 96