Strange `pickle`/`gpflow.utilities.freeze` behaviour with gpflow models

Question

I have been trying to (crudely) train and save a gpflow SVGP model on a toy dataset largely following this notebook example

Upon saving the model using pickle (I appreciate this is not recommended but I don't believe this is the main issue here), I discovered some unusual and what I presume is unintended behaviour: If we do not call gpflow.utilities.freeze(model), before trying to pickle model, then we get an error. If we do call gpflow.utilities.freeze(model) (discarding the returned frozen model), then model can be pickled without error.

To reproduce

Minimal, reproducible example

import numpy as np
import gpflow
import tensorflow as tf
import pickle
rng = np.random.RandomState(123)

N = 10000  # Number of training observations
X = rng.rand(N, 1)
Y = rng.randn(N, 1)
data = (X, Y)

n_inducing_vars = 100
Z = X[:n_inducing_vars]
minibatch_size = 100
n_iterations = 100

#Define model object
model = gpflow.models.SVGP(gpflow.kernels.Matern12(), gpflow.likelihoods.Bernoulli(), inducing_variable=Z, num_data=N)

#Create minibatch object
data_minibatch = (
tf.data.Dataset.from_tensor_slices(data).prefetch(
    N).repeat().shuffle(N).batch(minibatch_size)
    )
data_minibatch_it = iter(data_minibatch)
model_objective = model.training_loss_closure(data_minibatch_it)

#Define optimiser
optimizer = tf.keras.optimizers.Adam(0.001)
#Optimise both variational parameters and kernel hyperparameters.
for step in range(n_iterations):
    optimizer.minimize(model_objective,
                       var_list=model.trainable_variables
                       )

freeze = False
if not freeze:
    # pickle doesn't work
    pickle.dump(model, open('test1', 'wb'))
else:
    # if following code is executed, pickle works fine
    _ = gpflow.utilities.freeze(model)  # ignore return value
    pickle.dump(model, open('test1', 'wb'))

Stack trace, or error message

TypeError                                 Traceback (most recent call last)
<ipython-input-6-3d5f537ca994> in <module>
----> 1 pickle.dump(model, open('test1', 'wb'))

TypeError: can't pickle HashableWeakRef objects

Expected behaviour

Not saying that I expected the pickle to work in the first instance, as I know it isn't the recommended way of saving tensorflow-related objects in general. However, I certainly wouldn't expect it to fail in the first instance but succeed in the second. From looking at the codebase, I don't believe gpflow.utilities.freeze(model) should be mutating model, which it seems to be doing.

System information

Tested with GPflow versions 2.0.0 ... 2.0.4
TensorFlow version: 2.1.0, tensorflow_probability 0.9.0
Python version: Python 3.6.9

I would guess that in calling freeze on model it is inexplicably actually converting model into a "frozen" model, which then has the "constant" properties (https://gpflow.readthedocs.io/en/master/notebooks/intro_to_gpflow2.html#TensorFlow-saved_model) that enable it to be pickled.

Any clarity on this matter would be very much appreciated.

Note I posted this as an issue on the gpflow github (https://github.com/GPflow/GPflow/issues/1493), but it was decided that this issue should be broadcast here to the wider gpflow community.

STJ · Answer 1 · 2020-06-03T14:40:45.757

This behaviour applies to any code/model that uses tensorflow_probability's bijectors, and is not restricted to the SVGP model. In GPflow, bijectors are used to constrain parameters, e.g. to ensure that kernel variances and lengthscales are always positive.

The underlying explanation is that tensorflow_probability's bijectors keep a cache of tensors they have operated on, which for example allows them to exactly recover the original tensor in the following example:

import tensorflow as tf
import tensorflow_probability as tfp
bij = tfp.bijectors.Exp()
x = tf.constant(1.2345)
y = bij.forward(x)
assert bij.inverse(y) is x  # actual object identity, not just numerical equivalence

These caches, however, use the HashableWeakRef objects that can't be pickled - or even copied (using the Python stdlib's copy.deepcopy function).

The caches only get populated when you actually run tensors through the bijector - if you just create the model and don't optimise it, you can pickle (or copy) it just fine. But of course that's not very useful in general.

To work around this issue and allow copying even of "used" (e.g. trained) models, we have gpflow.utilities.reset_cache_bijectors(). This is called by gpflow.utilities.deepcopy() to allow copying. And gpflow.utilities.freeze() in turn needs to deepcopy so as to give you a frozen copy, instead of freezing the model in-place, which explains the minor side-effect.

So it is not the freeze that is required to enable you to successfully pickle it; it's sufficient to add the call to reset_cache_bijectors(model) before pickling, replacing the code in your example with

if not freeze:
    gpflow.utilities.reset_cache_bijectors(model)  # with this added call, pickle *does* work
    pickle.dump(model, open('test1', 'wb'))

Ultimately, this is an issue that can only be fixed "properly" upstream, by tensorflow_probability in their own code. More details can be found in this pull request to tensorflow_probability by awav that aims to address this issue.

On a side note, as pointed out by markvdw, you might find it easier to store all parameter values of the model obtained using gpflow.utilities.read_values() (which returns a dict of parameter keys to values), which you can store in any way you like, and re-load by first re-creating the object and then assigning the parameters using gpflow.utilities.multiple_assign().

score 1 · Answer 2 · answered Jun 03 '20 at 15:09

Let's see what happens in lines below:

if not freeze:
    # pickle doesn't work
    pickle.dump(model, open('test1', 'wb'))  # Line 1
else:
    # if following code is executed, pickle works fine
    gpflow.utilities.freeze(model)           # Line 2
    pickle.dump(model, open('test1', 'wb'))  # Line 3

In Line 1 the trained model contains Parameters instances that hold TensorFlow probability bijectors as transformers from constrained to unconstrained space and back. A TFP bijector caches all forward and inverse computations. The bijector's cache is implemented with a map, where keys are tensor inputs of the forward and inverse functions and values are returned objects that are tensors as well. Unfortunately, tensors (e.g. np.arrays) cannot be hashed and for that purpose TFP implemented a wrapper HashableWeakRef for tensors. The error message "TypeError: can't pickle HashableWeakRef objects" is misleading. It actually means that HashableWeakRef python cannot make a copy of that instance, simply because it is a reference to an object that hasn't been created yet. As a consequence, these objects cannot be pickled.

In Line 2, you have the freeze method that consists of two calls: the first call removes the content of bijectors, i.e. cache, and the second call is copy.deepcopy. The magic behind freeze is that it removes references. Yes, it modifies the existing object, but it doesn't affect neither the eager computation nor tf.functioned functions. The cleaning makes deepcopy possible.

Line 3 works because the object doesn't have references to copy.

This issue has a long track of reports and attempts fixing it inside GPflow: TFP#547, TFP#944, GPflow#1479, GPflow#1293, GPflow#1338

This is a proposed fix in TensorFlow probability: TFP#947

if I understand correctly, once this issue is solved in TFP, the `weakref` objects associated with the caches will be deep copyable, so that `bijectors` containing caches can be `pickled`/`deepcopied`? Consequently, referring back to my example, a `gpflow` model could be pickled without having to call `gpflow.utilities.reset_cache_bijectors` (as in the proposed temporary solution)? Finally this means that this operation doesn't need to be performed in `gpflow.utilities.deepcopy` either, meaning that `gpflow.freeze` will no longer mutate model in the original example? — KamKam, Jun 19 '20 at 10:45

Strange `pickle`/`gpflow.utilities.freeze` behaviour with gpflow models

To reproduce

Expected behaviour

System information

2 Answers2