2

I am building an MLP in python with sklearn.neural_network MLPRegressor.

I have a grid search:

param_grid={'hidden_layer_sizes': [(100,100), (50,50,50), (100,)],
            ....
            'solver':['adam', 'sgd']}

grid=GridSearchCV(MLPRegressor, param_grid, cv)
grid.fit(x_train, y_train)
...

What I find really strange: If I delete the solver in the param_grid, and adam as solver is selected, everything runs perfectly fine.

However, I want to use sgd as solver. As soon as I use that in the param Grid and don't change anything else, I get the error:

Value Error: Input contains NAN, infinity or a value too large for dtype ('float64') for line grid.fit

I checked my input: no Nan, no infinity, and normal values scaled between 0 and 1.

Why is that

  • Have you checked the optimizer parameter? there's no guaranty the optimizer will always converge for different learning rates. – rrrttt Feb 12 '20 at 22:13
  • Hi, what exactly do you mean with optimizer parameter? Even if I only put the solver and nothing else in the param grid, I get that error. I just don't know what to look for – Studentpython Feb 13 '20 at 09:54
  • The update rule for the sgd is the following: `w <- w - eta*gradient(w)`, this process is repeated for N epochs, eta and N are the hyperparameters of this optimizer. Sometimes you need to change them to reach the global minima of your loss function. – rrrttt Feb 13 '20 at 10:41
  • Thanks for the explanation. I tried changing the epochs and N, with no success. I don't know how to change eta with the MLPregressor. I read scikit-learn that "SGDRegressor is well suited for regression problems with a large number of training samples (> 10.000), for other problems we recommend Ridge, Lasso, or ElasticNet". My training sample has only a few hundred points. Could this be the reason? – Studentpython Feb 13 '20 at 12:52
  • The Stochastic Gradient Descent trains on batches, it means we evaluate the gradients on small partitions of the data set, it's quite good when we are dealing with complex loss functions with tons of local minima points and a large amount of data available as well. If you have got a small data set the sgd could bounce forever without finding the global minima, so that's why scikit learn advises you to use a simple learning model( the gradients are evaluated on the whole data). – rrrttt Feb 13 '20 at 19:22

0 Answers0