Parallel Keras model training using python mutliprocessing

Question

I am training on a 64 core CPU workstation multiple Keras MLP models simultaneously. Therefore I am using the Python multiprocessing pool to allocate for each CPU one model being trained. For the model being trained I am using an Early Stopping and Model checkpoint callback defined in this manner:

es = EarlyStopping(monitor='val_mse', mode='min', verbose=VERBOSE_ALL, patience=10)
mc = ModelCheckpoint('best_model.h5', monitor='val_mse', mode='min', verbose=VERBOSE_ALL, save_best_only=True)

Using a single model the training runs through without any problems. When I start using the multiprocessing pool however, I end up having issues with the callbacks. A hdf5 model saving issue comes up:

Traceback (most recent call last):
  File "C:\Users\ICN_admin\Anaconda3\lib\site-packages\tensorflow_core\python\keras\callbacks.py", line 1029, in _save_model
    self.model.save(filepath, overwrite=True)
  File "C:\Users\ICN_admin\Anaconda3\lib\site-packages\tensorflow_core\python\keras\engine\network.py", line 1008, in save
    signatures, options)
  File "C:\Users\ICN_admin\Anaconda3\lib\site-packages\tensorflow_core\python\keras\saving\save.py", line 112, in save_model
    model, filepath, overwrite, include_optimizer)
  File "C:\Users\ICN_admin\Anaconda3\lib\site-packages\tensorflow_core\python\keras\saving\hdf5_format.py", line 92, in save_model_to_hdf5
    f = h5py.File(filepath, mode='w')
  File "C:\Users\ICN_admin\Anaconda3\lib\site-packages\h5py\_hl\files.py", line 394, in __init__
    swmr=swmr)
  File "C:\Users\ICN_admin\Anaconda3\lib\site-packages\h5py\_hl\files.py", line 176, in make_fid
    fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl)
  File "h5py\_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py\_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py\h5f.pyx", line 105, in h5py.h5f.create
OSError: Unable to create file (file signature not found)

This error comes more or less sporadically, and through exceptions I can catch it for repeating the model training. But is there a way to work around this issue by setting flags or using a different callback file format?

Tensorflow version: 2.1.0

Keras version: 2.3.1

library include:

from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.callbacks import ModelCheckpoint

Make sure every time you create a `h5` file, you close with `f.close()` it before calling a new one or creating a new one. And sometimes issue can be de to using many workers, try reducing the worker nodes and it may fix your issue. You can follow this thread and check the suggestions from the users https://github.com/keras-team/keras/issues/11101 — , Dec 03 '20 at 09:54

Parallel Keras model training using python mutliprocessing

0 Answers0