0

I am using the latest version of PYMC3 / Theano to compute the MAP estimate for the following simple logistic regression model:

w ~ N(0, 1)
y ~ Bin(σ(Xw))

My features are strictly categorical, and I am using one-hot encoding to transform the design matrix into a sparse (CSR) matrix of 0's and 1's. The sparse matrix occupies roughly 15MB, but I am seeing memory spikes up to 4GB. Incidentally, this is the size of a dense matrix with the same dimensions.

Below is a simplified version of my code,

from __future__ import division, print_function

import numpy as np
import pymc3 as pm
from scipy import sparse as sp
from scipy.optimize import fmin_l_bfgs_b
from scipy.special import expit
from theano import sparse as S
np.random.seed(0)

# properties of sparse design matrix (taken from the real data)
N = 100000  # number of samples
M = 5000    # number of dimensions
D = 0.002   # matrix density

# fake data
mu0, sd0 = 0.0, 1.0
w = np.random.normal(mu0, sd0, M)
X = sp.random(N, M, density=D, format='csr', data_rvs=np.ones)
y = np.random.binomial(1, expit(X.dot(w)), N)

# estimate memory usage
size = X.data.nbytes + X.indices.nbytes + X.indptr.nbytes + y.nbytes
print('{:.2f} MB of data'.format(size / 1024 ** 2))

# model definition
with pm.Model() as model:
    w = pm.Normal('w', mu0, sd=sd0, shape=M)

    p = pm.sigmoid(S.dot(X, w.reshape((-1, 1)))).flatten()
    pm.Bernoulli('y', p, observed=y)

    print(pm.find_MAP(vars=[w], fmin=fmin_l_bfgs_b))

which produces the following memory profile: memory profile

Even if the L-BFGS optimizer computed the full dense Hessian (which it doesn't), that would take up only 190MB. I suspect that somewhere along the way, X is automatically converted into a Numpy array to implement some Theano op.

Has anyone successfully worked with sparse matrices in PYMC3? Any ideas where exactly the problem could lie?

Igor Raush
  • 15,080
  • 1
  • 34
  • 55
  • 1
    I have no experience with sparse-matrices and PyMC3. But using you model I can reproduce the memory error and I also have trouble sampling from it with NUTS or ADVI. Metropolis runs, but seems unstable (but maybe this is due to the way the problem is formulated). May be you should open an issue on [github](https://github.com/pymc-devs/pymc3). – aloctavodia Jun 03 '16 at 12:19
  • @aloctavodia Thanks for confirming. Will file an issue later today. – Igor Raush Jun 03 '16 at 16:52
  • I could not find your issue. Did you raise it? :) – Roelant Dec 29 '21 at 10:00

0 Answers0