Questions tagged [gradient-descent]

Gradient Descent is an algorithm for finding the minimum of a function. It iteratively calculates partial derivatives (gradients) of the function and descends in steps proportional to those partial derivatives. One major application of Gradient Descent is fitting a parameterized model to a set of data: the function to be minimized is an error function for the model.

Wiki:

Gradient descent is a first-order iterative optimization algorithm. It is an optimization algorithm used to find the values of parameters (coefficients) of a function (f) that minimizes a cost function (cost).

To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point.

Gradient descent is best used when the parameters cannot be calculated analytically (e.g. using linear algebra) and must be searched for by an optimization algorithm.

Gradient descent is also known as steepest descent, or the method of steepest descent.


Tag usage:

Questions on should be about implementation and programming problems, not about the theoretical properties of the optimization algorithm. Consider whether your question might be better suited to Cross Validated, the StackExchange site for statistics, machine learning and data analysis.


Read more:

1428 questions
5
votes
2 answers

Write Custom Python-Based Gradient Function for an Operation? (without C++ Implementation)

I'm trying to write a custom gradient function for 'my_op' which for the sake of the example contains just a call to tf.identity() (ideally, it could be any graph). import tensorflow as tf from tensorflow.python.framework import function def…
njk
  • 645
  • 2
  • 10
  • 19
5
votes
2 answers

How to implement mini-batch gradient descent in python?

I have just started to learn deep learning. I found myself stuck when it came to gradient descent. I know how to implement batch gradient descent. I know how it works as well how mini-batch and stochastic gradient descent works in theory. But really…
5
votes
1 answer

Tensorflow understanding tf.train.shuffle_batch

I have a single file of training data, about 100K rows, and I'm running a straightforward tf.train.GradientDescentOptimizer on each training step. The setup is essentially taken directly from Tensorflow's MNIST example. Code reproduced below: x =…
5
votes
2 answers

Why my Gradient is wrong (Coursera, Logistic Regression, Julia)?

I'm trying to do Logistic Regression from Coursera in Julia, but it doesn't work. The Julia code to calculate the Gradient: sigmoid(z) = 1 / (1 + e ^ -z) hypotesis(theta, x) = sigmoid(scalar(theta' * x)) function gradient(theta, x, y) (m, n)…
Alexey Petrushin
  • 1,311
  • 3
  • 10
  • 24
5
votes
1 answer

How to get a gradient node with mxnet.jl and Julia?

I'm trying to replicate the following example from the mxnet main docs with mxnet.jl in Julia: A = Variable('A') B = Variable('B') C = B * A D = C + Constant(1) # get gradient node. gA, gB = D.grad(wrt=[A, B]) # compiles the gradient function. f =…
Bernhard Kausler
  • 5,119
  • 3
  • 32
  • 36
5
votes
2 answers

Gradient Descent vs Stochastic Gradient Descent algorithms

I tried to train a FeedForward Neural Network on the MNIST Handwritten Digits dataset (includes 60K training samples). I each time iterated over all the training samples, performing Backpropagation for each such sample on every epoch. The runtime is…
5
votes
1 answer

theano hard_sigmoid() breaks gradient descent

for intents of highlighting the issue lets follow this tutorial. theano has 3 ways to compute the sigmoid of a tensor, namely sigmoid, ultra_fast_sigmoid and hard_sidmoid. It seems using the latter two breaks the gradient descent algorithm. The…
user2255757
  • 756
  • 1
  • 6
  • 24
5
votes
2 answers

Understanding softmax classifier

I am trying to understand a simple implementation of Softmax classifier from this link - CS231n - Convolutional Neural Networks for Visual Recognition. Here they implemented a simple softmax classifier. In the example of Softmax Classifier on the…
Shubhashis
  • 10,411
  • 11
  • 33
  • 48
5
votes
1 answer

Spark mllib predicting weird number or NaN

I am new to Apache Spark and trying to use the machine learning library to predict some data. My dataset right now is only about 350 points. Here are 7 of those…
5
votes
1 answer

Neural Network Mini Batch Gradient Descent

I am working with a multi-layer neural network. I intend to do mini-batch gradient descent. Suppose I have mini-batches of 100 over 1 million data points. I don't understand the part where I have to update the weights of the whole network. When I do…
Sasha
  • 492
  • 2
  • 6
  • 21
5
votes
1 answer

Multi variable gradient descent

I am learning gradient descent for calculating coefficients. Below is what I am doing: #!/usr/bin/Python import numpy as np # m denotes the number of examples here, not the number of features def gradientDescent(x, y, theta, alpha, m,…
5
votes
2 answers

Programing Logistic regression with Stochastic gradient descent in R

I’m trying to program the logistic regression with stochastic descending gradient in R. For example I have followed the example of Andrew Ng named: “ex2data1.txt”. The point is that the algorithm works properly, but thetas estimation is not exactly…
user3488416
  • 51
  • 1
  • 3
5
votes
1 answer

Rescaling after feature scaling, linear regression

Seems like a basic question, but I need to use feature scaling (take each feature value, subtract the mean then divide by the standard deviation) in my implementation of linear regression with gradient descent. After I'm finished, I'd like the…
5
votes
1 answer

Multi variable gradient descent in matlab

I'm doing gradient descent in matlab for mutiple variables, and the code is not getting the expected thetas I got with the normal eq. that are: theta = 1.0e+05 * 3.4041 1.1063 -0.0665 With the Normal eq. I have implemented. And with…
Pedro.Alonso
  • 1,007
  • 3
  • 20
  • 41
4
votes
1 answer

Why do we multiply learning rate by gradient accumulation steps in PyTorch?

Loss functions in pytorch use "mean" reduction. So it means that the model gradient will have roughly the same magnitude given any batch size. It makes sense that you want to scale the learning rate up when you increase batch size because your…
off99555
  • 3,797
  • 3
  • 37
  • 49