3

I have X_train and y_train as 2 numpy.ndarrays of size (32561, 108) and (32561,) respectively.

I am receiving a memory error every time I call fit for my GaussianProcessClassifier.

>>> import pandas as pd
>>> import numpy as np
>>> from sklearn.gaussian_process import GaussianProcessClassifier
>>> from sklearn.gaussian_process.kernels import RBF
>>> X_train.shape
(32561, 108)
>>> y_train.shape
(32561,)
 >>> gp_opt = GaussianProcessClassifier(kernel=1.0 * RBF(length_scale=1.0))
>>> gp_opt.fit(X_train,y_train)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/retsim/.local/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 613, in fit
    self.base_estimator_.fit(X, y)
  File "/home/retsim/.local/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 209, in fit
    self.kernel_.bounds)]
  File "/home/retsim/.local/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 427, in _constrained_optimization
    fmin_l_bfgs_b(obj_func, initial_theta, bounds=bounds)
  File "/home/retsim/anaconda2/lib/python2.7/site-packages/scipy/optimize/lbfgsb.py", line 199, in fmin_l_bfgs_b
    **opts)
  File "/home/retsim/anaconda2/lib/python2.7/site-packages/scipy/optimize/lbfgsb.py", line 335, in _minimize_lbfgsb
    f, g = func_and_grad(x)
  File "/home/retsim/anaconda2/lib/python2.7/site-packages/scipy/optimize/lbfgsb.py", line 285, in func_and_grad
    f = fun(x, *args)
  File "/home/retsim/anaconda2/lib/python2.7/site-packages/scipy/optimize/optimize.py", line 292, in function_wrapper
    return function(*(wrapper_args + args))
  File "/home/retsim/anaconda2/lib/python2.7/site-packages/scipy/optimize/optimize.py", line 63, in __call__
    fg = self.fun(x, *args)
  File "/home/retsim/.local/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 201, in obj_func
    theta, eval_gradient=True)
  File "/home/retsim/.local/lib/python2.7/site-packages/sklearn/gaussian_process/gpc.py", line 338, in log_marginal_likelihood
    K, K_gradient = kernel(self.X_train_, eval_gradient=True)
  File "/home/retsim/.local/lib/python2.7/site-packages/sklearn/gaussian_process/kernels.py", line 753, in __call__
    K1, K1_gradient = self.k1(X, Y, eval_gradient=True)
  File "/home/retsim/.local/lib/python2.7/site-packages/sklearn/gaussian_process/kernels.py", line 1002, in __call__
    K = self.constant_value * np.ones((X.shape[0], Y.shape[0]))
  File "/home/retsim/.local/lib/python2.7/site-packages/numpy/core/numeric.py", line 188, in ones
    a = empty(shape, dtype, order)
MemoryError
>>> 

Why am I getting this error, and how can I fix it?

yalpsid eman
  • 3,064
  • 6
  • 45
  • 71

2 Answers2

11

According to the Scikit-Learn documentation, the estimator GaussianProcessClassifier (as well as GaussianProcessRegressor), has a parameter copy_X_train which is set to True by default:

class sklearn.gaussian_process.GaussianProcessClassifier(kernel=None, optimizer=’fmin_l_bfgs_b’, n_restarts_optimizer=0, max_iter_predict=100, warm_start=False, copy_X_train=True, random_state=None, multi_class=’one_vs_rest’, n_jobs=1)

The description for the parameter copy_X_train notes that:

If True, a persistent copy of the training data is stored in the object. Otherwise, just a reference to the training data is stored, which might cause predictions to change if the data is modified externally.

I had tried fitting the estimator with a similar sized training dataset ( observations and features) as mentioned by the OP, on a PC with 32 GB RAM. With copy_X_train set to True, 'a persistent copy of the training data' was possibly eating up my RAM resulting in a MemoryError. Setting this parameter to False fixed the issue.

Scikit-Learn's description notes that, based on this setting 'just a reference to the training data is stored, which might cause predictions to change if the data is modified externally'. My interpretation of this statement is:

Instead of storing the whole training dataset (in the form of a matrix of size nxn based on n observations) in the fitted estimator, only a reference to this dataset is stored - hence avoiding the high RAM usage. As long as the dataset stays intact externally (i.e not within the fitted estimator), it can be reliably fetched when a prediction has to be made. Modification of the dataset affects the predictions.

There may be better interpretations and theoretical explanations.

gau
  • 121
  • 1
  • 4
9

On line 400 of gpc.py, the implementation for the classifier you're using, there's a matrix created that has size (N, N), where N is the number of observations. So the code is trying to create a matrix of shape (32561, 32561). That will obviously cause some problems, since that matrix has over a billion elements.

As to why it's doing this, I don't really know scikit-learn's implementation, but in general, Gaussian processes require estimating covariance matrices over the whole input space, which is why they're not that great if you have high-dimensional data. (The docs say "high-dimensional" is anything greater than a few dozen.)

My only recommendation for how to fix it is to work in batches. Scikit-learn may have some utilities to do generate batches for you, or you can do it manually.

bnaecker
  • 6,152
  • 1
  • 20
  • 33
  • +1. If you already know this and have a lot of RAM, so you expect it to work, then double-check that you are running 64-bit Python. 32-bit Python on a 64-bit OS will probably only be able to access 2GB of RAM. – Stev Mar 28 '18 at 08:46
  • The link doesn't work anymore. `gpc.py` was moved to [`_gpc.py`](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/gaussian_process/_gpc.py) – jonas Oct 17 '20 at 10:59