20

I'm working on training a LSTM network on Google Cloud Machine Learning Engine using Keras with TensorFlow backend. I managed it to deploy my model and perform a successful training task after some adjustments to the gcloud and my python script.

I then tried to make my model save checkpoints after every epoch using Keras modelCheckpoint callback. Running a local training job with Google Cloud works perfectly as expected. The weights are getting stored in the specified path after each epoch. But when I try to run the same job online on Google Cloud Machine Learning Engine the weights.hdf5 does not get written to my Google Cloud Bucket. Instead I get the following error:

...
File "h5f.pyx", line 71, in h5py.h5f.open (h5py/h5f.c:1797)
IOError: Unable to open file (Unable to open file: name = 
'gs://.../weights.hdf5', errno = 2, error message = 'no such file or
directory', flags = 0, o_flags = 0)

I investigated this issue and it turned out, that there is no Problem with the the Bucket itself, as Keras Tensorboard callback does work fine and writes the expected output to the same bucket. I also made sure that h5py gets included by providing it in the setup.py located at:

├── setup.py
    └── trainer
    ├── __init__.py
    ├── ...

The actual include in setup.py is shown below:

# setup.py
from setuptools import setup, find_packages

setup(name='kerasLSTM',
      version='0.1',
      packages=find_packages(),
      author='Kevin Katzke',
      install_requires=['keras','h5py','simplejson'],
      zip_safe=False)

I guess the problem comes down to the fact that GCS cannot be accessed with Pythons open for I/O since it instead provides a custom implementation:

import tensorflow as tf
from tensorflow.python.lib.io import file_io

with file_io.FileIO("gs://...", 'r') as f:
    f.write("Hi!")

After checking how Keras modelCheckpoint callback implements the actual file writing and it turned out, that it is using h5py.File() for I/O:

 with h5py.File(filepath, mode='w') as f:
    f.attrs['keras_version'] = str(keras_version).encode('utf8')
    f.attrs['backend'] = K.backend().encode('utf8')
    f.attrs['model_config'] = json.dumps({
        'class_name': model.__class__.__name__,
        'config': model.get_config()
 }, default=get_json_type).encode('utf8')

And as the h5py package is a Pythonic interface to the HDF5 binary data format the h5py.File() seems to call an underlying HDF5 functionality written in Fortran as far as I can tell: source, documentation.

How can I fix this and make the modelCheckpoint callback write to my GCS Bucket? Is there a way for "monkey patching" to somehow overwrite how a hdf5 file is opened to make it use GCS's file_io.FileIO()?

Kevin Katzke
  • 3,581
  • 3
  • 38
  • 47
  • 1
    This may not apply to CloudML but one thing you may want to explore is the GCSFUSE utility. I don't know if you can use it in the context of CloudML, but I normally use when running TF on regular Google Cloud VMs that run Ubuntu. Gcsfuse lets you map a local directory on the Ubuntu VM to a Google Cloud Bucket, so to Python the cloud bucket starts to look like a regular dir. Again, not sure if you can use it with CloudML but think about it.. – VS_FF Aug 09 '17 at 08:56
  • Thanks @VS_FF I will investigate your suggestion and give you feedback on this. – Kevin Katzke Aug 15 '17 at 07:38
  • Leaving this here for anyone who are still having the same problem. I was able to solve (well, a workaround) this problem by creating a custom callback to copy checkpoints into GCS bucket after every epoch. I have already answered this on another question in stackoverflow. Please find it here -> https://stackoverflow.com/a/69226186/15319462 – Subrahmanya Joshi Sep 17 '21 at 15:53

8 Answers8

17

I might be a bit late on this, but for the sake of future visitors I would describe the whole process of how to adapt the code that was previously run locally to be GoogleML-aware from the IO point of view.

  1. Python standard open(file_name, mode) does not work with buckets (gs://...../file_name). One needs to from tensorflow.python.lib.io import file_io and change all calls to open(file_name, mode) to file_io.FileIO(file_name, mode=mode) (note the named mode parameter). The interface of the opened handle is the same.
  2. Keras and/or other libraries mostly use standard open(file_name, mode) internally. That said, something like trained_model.save(file_path) calls to 3rd-party libraries will fail to store the result to the bucket. The only way to retrieve a model after the job has finished successfully would be to store it locally and then move to the bucket.

The code below is quite inefficient, because it loads the whole model at once and then dumps it to the bucket, but it worked for me for relatively small models:

model.save(file_path)

with file_io.FileIO(file_path, mode='rb') as if:
    with file_io.FileIO(os.path.join(model_dir, file_path), mode='wb+') as of:
        of.write(if.read())

The mode must be set to binary for both reading and writing.

When the file is relatively big, it makes sense to read and write it in chunks to decrease memory consumption.

  1. Before running a real task, I would advise to run a stub that simply saves a file to remote bucket.

This implementation, temporarily put instead of real train_model call, should do:

if __name__ == '__main__':
    parser = argparse.ArgumentParser()

    parser.add_argument(
        '--job-dir',
        help='GCS location with read/write access',
        required=True
    )

    args = parser.parse_args()
    arguments = args.__dict__
    job_dir = arguments.pop('job_dir')

    with file_io.FileIO(os.path.join(job_dir, "test.txt"), mode='wb+') as of:
        of.write("Test passed.")

After a successful execution you should see the file test.txt with a content "Test passed." in your bucket.

Yulia
  • 621
  • 4
  • 6
  • Using the FileIO method that you described, I extended the `keras.callbacks.ModelCheckpoint` callback to save checkpoints in GCS. Based on Tensorflow 2.3. https://gist.github.com/seahrh/19c8779e159da35bcdc696245a2b24f6 – ruhong Oct 08 '20 at 18:38
  • Could we instead extend save_model to write directly on GCS to avoid writing checkpoint locally then saving/uploading to GCS? – Patrick Jan 19 '21 at 15:21
6

The issue can be solved with the following piece of code:

# Save Keras ModelCheckpoints locally
model.save('model.h5')

# Copy model.h5 over to Google Cloud Storage
with file_io.FileIO('model.h5', mode='r') as input_f:
    with file_io.FileIO('model.h5', mode='w+') as output_f:
        output_f.write(input_f.read())
        print("Saved model.h5 to GCS")

The model.h5 is saved on local filesystem and the copied over to GCS. As Jochen pointed out, there currently is no easy support to write HDF5 model checkpoints to GCS. With this hack it is possible to write the data until an easier solution is provided.

Kevin Katzke
  • 3,581
  • 3
  • 38
  • 47
4

I faced a similar problem and the solution above didn't work for me. The file must be read and be written in binary form. Otherwise this error will be thrown.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte

So the code will be

def copy_file_to_gcs(job_dir, file_path):
    with file_io.FileIO(file_path, mode='rb') as input_f:
        with file_io.FileIO(os.path.join(job_dir, file_path), mode='wb+') as output_f:
            output_f.write(input_f.read())
Manash Mandal
  • 421
  • 2
  • 5
4

Here is my code that I wrote to save the model after each epoch.

import os
import numpy as np
import warnings
from keras.callbacks import ModelCheckpoint

class ModelCheckpointGC(ModelCheckpoint):
"""Taken from and modified:
https://github.com/keras-team/keras/blob/tf-keras/keras/callbacks.py
"""

def on_epoch_end(self, epoch, logs=None):
    logs = logs or {}
    self.epochs_since_last_save += 1
    if self.epochs_since_last_save >= self.period:
        self.epochs_since_last_save = 0
        filepath = self.filepath.format(epoch=epoch, **logs)
        if self.save_best_only:
            current = logs.get(self.monitor)
            if current is None:
                warnings.warn('Can save best model only with %s available, '
                              'skipping.' % (self.monitor), RuntimeWarning)
            else:
                if self.monitor_op(current, self.best):
                    if self.verbose > 0:
                        print('Epoch %05d: %s improved from %0.5f to %0.5f,'
                              ' saving model to %s'
                              % (epoch, self.monitor, self.best,
                                 current, filepath))
                    self.best = current
                    if self.save_weights_only:
                        self.model.save_weights(filepath, overwrite=True)
                    else:
                        if is_development():
                            self.model.save(filepath, overwrite=True)
                        else:
                            self.model.save(filepath.split(
                                "/")[-1])
                            with file_io.FileIO(filepath.split(
                                    "/")[-1], mode='rb') as input_f:
                                with file_io.FileIO(filepath, mode='wb+') as output_f:
                                    output_f.write(input_f.read())
                else:
                    if self.verbose > 0:
                        print('Epoch %05d: %s did not improve' %
                              (epoch, self.monitor))
        else:
            if self.verbose > 0:
                print('Epoch %05d: saving model to %s' % (epoch, filepath))
            if self.save_weights_only:
                self.model.save_weights(filepath, overwrite=True)
            else:
                if is_development():
                    self.model.save(filepath, overwrite=True)
                else:
                    self.model.save(filepath.split(
                        "/")[-1])
                    with file_io.FileIO(filepath.split(
                            "/")[-1], mode='rb') as input_f:
                        with file_io.FileIO(filepath, mode='wb+') as output_f:
                            output_f.write(input_f.read())

There is a function is_development() that checks if it is the local or gcloud environment. On the local environment I did set the variable LOCAL_ENV=1:

def is_development():
    """check if the environment is local or in the gcloud
    created the local variable in bash profile
    export LOCAL_ENV=1

    Returns:
        [boolean] -- True if local env
    """
    try:
        if os.environ['LOCAL_ENV'] == '1':
            return True
        else:
            return False
    except:
        return False

Then you can use it:

 ModelCheckpointGC(
            'gs://your_bucket/models/model.h5',
            monitor='loss',
            verbose=1,
            save_best_only=True,
            mode='min'))

I hope that helps someone and saves some time.

Igor Markelov
  • 798
  • 2
  • 8
  • 16
3

A hacky workaround is to save to local filesystem, then copy using TF IO API. I added an example to the Keras example on GoogleCloudPlatform ML samples.

Basically it checks whether the target directory is a GCS path ("gs://") and will force writing h5py to the local filesystem, then copy to GCS using the TF file_io API. See for example: https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/census/keras/trainer/task.py#L146

Jochen
  • 51
  • 5
  • Thanks Jochen, the code in the Pull Request to GoogleCloudPlatform indeed solved the issue. If you edited your answer to include a description of how the hack works along with a complete working code example I will mark it as accepted. – Kevin Katzke Sep 17 '17 at 19:31
  • @Jochen, should `mode='w+'` be `mode='wb+'` in line 138 as @Manash pointed out? – Maosi Chen Jun 11 '19 at 22:56
2

For me the easiest way is to use gsutil.

model.save('model.h5')
!gsutil -m cp model.h5 gs://name-of-cloud-storage/model.h5
Aurélien
  • 21
  • 2
1

I am not sure why this is not mentioned already, but there is a solution where you don't need to add a copy function in your code.

Install gcsfuse following these steps:

export GCSFUSE_REPO=gcsfuse-`lsb_release -c -s`
echo "deb http://packages.cloud.google.com/apt $GCSFUSE_REPO main" | sudo tee /etc/apt/sources.list.d/gcsfuse.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
sudo apt-get update
sudo apt-get install gcsfuse

Then mount your bucket locally:

mkdir bucket
gcsfuse <cloud_bucket_name> bucket

and then use the local directory bucket/ as the logdir of your model.

Syncing of cloud and local directory will be automated for you and your code can stay clean.

Hope it helps :)

gdaras
  • 9,401
  • 2
  • 23
  • 39
0
tf.keras.models.save_model(model, filepath, save_format="tf")

save_format: Either 'tf' or 'h5', indicating whether to save the model to Tensorflow SavedModel or HDF5. Defaults to 'tf' in TF 2.X, and 'h5' in TF 1.X.

fiona
  • 19
  • 4