I am using the latest version of PYMC3 / Theano to compute the MAP estimate for the following simple logistic regression model:
w ~ N(0, 1)
y ~ Bin(σ(Xw))
My features are strictly categorical, and I am using one-hot encoding to transform the design matrix into a sparse (CSR) matrix of 0's and 1's. The sparse matrix occupies roughly 15MB, but I am seeing memory spikes up to 4GB. Incidentally, this is the size of a dense matrix with the same dimensions.
Below is a simplified version of my code,
from __future__ import division, print_function
import numpy as np
import pymc3 as pm
from scipy import sparse as sp
from scipy.optimize import fmin_l_bfgs_b
from scipy.special import expit
from theano import sparse as S
np.random.seed(0)
# properties of sparse design matrix (taken from the real data)
N = 100000 # number of samples
M = 5000 # number of dimensions
D = 0.002 # matrix density
# fake data
mu0, sd0 = 0.0, 1.0
w = np.random.normal(mu0, sd0, M)
X = sp.random(N, M, density=D, format='csr', data_rvs=np.ones)
y = np.random.binomial(1, expit(X.dot(w)), N)
# estimate memory usage
size = X.data.nbytes + X.indices.nbytes + X.indptr.nbytes + y.nbytes
print('{:.2f} MB of data'.format(size / 1024 ** 2))
# model definition
with pm.Model() as model:
w = pm.Normal('w', mu0, sd=sd0, shape=M)
p = pm.sigmoid(S.dot(X, w.reshape((-1, 1)))).flatten()
pm.Bernoulli('y', p, observed=y)
print(pm.find_MAP(vars=[w], fmin=fmin_l_bfgs_b))
which produces the following memory profile:
Even if the L-BFGS optimizer computed the full dense Hessian (which it doesn't), that would take up only 190MB. I suspect that somewhere along the way, X
is automatically converted into a Numpy array to implement some Theano op.
Has anyone successfully worked with sparse matrices in PYMC3? Any ideas where exactly the problem could lie?