0

This question is related to the wonderful pymc3 module. I want to save an mcmc result to disk so that later, when new data comes in, I can use sample_ppc without having to train again. Here is some code that does exactly that, borrowed from PyMC3's documentation on posterior checks:

from theano import shared
import numpy as np
import pymc3 as pm

def learn():
    def invlogit(x):        
      return np.exp(x) / (1 + np.exp(x))

    coeff = 1
    predictors = np.random.normal(size=1e6)

    predictors_shared = shared(predictors)
    outcomes = np.random.binomial(1, invlogit(coeff * predictors))

    def tinvlogit(x):
        import theano.tensor as t
        return t.exp(x) / (1 + t.exp(x))

    with pm.Model() as model:
        coeff = pm.Normal('coeff', mu=0, sd=1)
        p = tinvlogit(coeff * predictors_shared)
        o = pm.Bernoulli('o', p, observed=outcomes)
        trace = pm.sample(5000, n_init=5000)

    # reduce the shared variable memory requirement
    predictors_shared.set_value(np.zeros(1))

    return {'trace': trace, 'model': model, 'predictors_shared': predictors_shared}

def predict(trace, model, predictors_shared):
    predictors_oos = np.random.normal(size=50)
    predictors_shared.set_value(predictors_oos)
    return pm.sample_ppc(trace, model=model, samples=500)

First we learn:

import pickle
learned_result = learn()
with open('some/file.pkl', 'wb') as f:
    pickle.dump(learned_result, f)

Then we unpickle and make predictions for new data:

with open('some/file.pkl', 'rb') as f:
    learned_result = pickle.load(learned_result, f)
ppc = predict(**learned_result)

And this works great except for a storage problem -- the pickled learned_result is huge. The killer is model. Judging by the relative sizes, I think the model is storing internally the entire training dataset. Is there a way to delete the internally stored data from the model object? Will my sample_ppc still work if I do this? Is there some theoretical reason why model has to keep in memory the entire training dataset in order to do a posterior predictive check? Thank in advance for any help.

Charles F
  • 539
  • 6
  • 11

1 Answers1

0

save the trace, not the model, then you can build the same model and use the result without having to run the sampler again.

  • 1
    in that case, can you give me a snippet of code that would show how `pm.sample_ppc` could be made aware of the preserved `trace`? – Charles F Jun 28 '18 at 16:29