0

I'm using the following code to save checkpoints while a google cloud build runs my model:

 cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath = "gs://mybucket/checkpoints", 
                                                   verbose=0,
                                                   save_weights_only=True,
                                                   monitor='val_loss',
                                                   mode='min',
                                                   save_best_only=True)

I'm getting no errors in my build logs, but the only thing in the bucket after each run is a tf_cloud_train_tar file containing the source directory contents.

I'm using callbacks = [cp_callback] in model.fit.

JM Gelilio
  • 3,482
  • 1
  • 11
  • 23
Adam Pelah
  • 15
  • 5
  • 1
    Please, see [this](https://stackoverflow.com/questions/45585104/save-keras-modelcheckpoints-in-google-cloud-bucket) SO question, I think it could be of help. Basically, save the model locally and then write it to GCS. It is the same approach proposed in this Keras [issue](https://github.com/keras-team/keras/issues/7935). – jccampanero Dec 30 '20 at 23:42
  • I actually don't even need or want it on the GCS, I'd rather it be written locally, but when running it using google cloud build through tensorflow cloud it doesn't seem to same locally either. – Adam Pelah Dec 31 '20 at 00:46
  • Are you setting the path correctly? It should be something like this `'/home/jupyter/checkpoint/best_model_{epoch}.h5',` – yudhiesh Dec 31 '20 at 02:44
  • @yudhiesh Yes I am. On [this guide](https://blog.tensorflow.org/2020/08/train-your-tensorflow-model-on-google.html) they say that checkpoints can be used as long as the storage destination is in the google bucket. I've tried a path to the bucket and a local one, nothing is being stored. – Adam Pelah Dec 31 '20 at 15:39

2 Answers2

0

I was having this problem for several reasons:

  • Dataset was not on the storage bucket, and so the code had no access to it.
  • Use of generator for dataset without files creates an infinite loop, but no crash.

I switched to AI Platform and sourced my data from the GCS Bucket and the problem was fixed.

Adam Pelah
  • 15
  • 5
0

Leaving this here for anyone who may be are having the same problem.

I was also having the same problem while training my model on AI platform. No matter what i did, ModelCheckpoint callback was not able to save it directly to GCS.

I was able to solve it by creating a custom callback. We can create a callback to do anything we want, at multiple instances during an epoch, by inheriting Callback class from tensorflow.keras.callbacks module and overriding required functions.

I made ModelCheckpoint callback to write to local directory and created a custom callback to copy those checkpoint files to GCS bucket.

The implementation is available in my github repo here -> https://github.com/Subrahmanyajoshi/Cancer-Detection-using-GCP/blob/07845c1f0c86b727e5ce043a3db4d4cb0e5ed1df/detectors/tf_gcp/trainer/callbacks.py#L10