This question is related to the wonderful pymc3
module. I want to save an mcmc result to disk so that later, when new data comes in, I can use sample_ppc
without having to train again. Here is some code that does exactly that, borrowed from PyMC3's documentation on posterior checks:
from theano import shared
import numpy as np
import pymc3 as pm
def learn():
def invlogit(x):
return np.exp(x) / (1 + np.exp(x))
coeff = 1
predictors = np.random.normal(size=1e6)
predictors_shared = shared(predictors)
outcomes = np.random.binomial(1, invlogit(coeff * predictors))
def tinvlogit(x):
import theano.tensor as t
return t.exp(x) / (1 + t.exp(x))
with pm.Model() as model:
coeff = pm.Normal('coeff', mu=0, sd=1)
p = tinvlogit(coeff * predictors_shared)
o = pm.Bernoulli('o', p, observed=outcomes)
trace = pm.sample(5000, n_init=5000)
# reduce the shared variable memory requirement
predictors_shared.set_value(np.zeros(1))
return {'trace': trace, 'model': model, 'predictors_shared': predictors_shared}
def predict(trace, model, predictors_shared):
predictors_oos = np.random.normal(size=50)
predictors_shared.set_value(predictors_oos)
return pm.sample_ppc(trace, model=model, samples=500)
First we learn:
import pickle
learned_result = learn()
with open('some/file.pkl', 'wb') as f:
pickle.dump(learned_result, f)
Then we unpickle and make predictions for new data:
with open('some/file.pkl', 'rb') as f:
learned_result = pickle.load(learned_result, f)
ppc = predict(**learned_result)
And this works great except for a storage problem -- the pickled learned_result
is huge. The killer is model
. Judging by the relative sizes, I think the model
is storing internally the entire training dataset. Is there a way to delete the internally stored data from the model
object? Will my sample_ppc
still work if I do this? Is there some theoretical reason why model
has to keep in memory the entire training dataset in order to do a posterior predictive check? Thank in advance for any help.