Keras Model stops training without indication as to why and how to enable GPU-acceleration

Question

I am trying to transfer-learn a pretrained MobileNet Model on a c5.large instance (AWS).

I am first training (burn-in) the last dense layer for a couple of epochs (tried between 5-20, does not seem to matter a whole lot).

After the burn-in period, I want to train the full model. However, this stops after a couple of epochs without an error.

Earlier I was trying without the burn-in period and that worked "fine-ish". Would typically crash the server after ~50 epochs (which is why I added the clipnorm, which did help a bit).

Any ideas on how to debug this are welcome.

Console Output:

Total params: 3,239,114
Trainable params: 3,217,226
Non-trainable params: 21,888
_________________________________________________________________
Epoch 6/25

 1/46 [..............................] - ETA: 9:22 - loss: 0.2123
 2/46 [>.............................] - ETA: 7:46 - loss: 0.2028ubuntu@ip-XXX:~$ ls

Training Code:

base_model = _mobilenet.MobileNet(
    input_shape=(224, 224, 3), include_top=False, pooling="avg"
)
if not options.mobile_net_weights:
    pretrained_weights = os.path.join(
        os.path.dirname(pretrained.__file__), "weights_mobilenet_aesthetic_0.07.hdf5"
    )
    base_model.load_weights(pretrained_weights, by_name=True)
# add dropout and dense layer
x = Dropout(0.6)(base_model.output)
x = Dense(units=classes, activation=last_activation)(x)

pretrained_model = Model(base_model.inputs, x)

# start training only dense layers
for layer in base_model.layers:
    layer.trainable = False

pretrained_model.compile(loss=loss, optimizer=Adam(lr=0.001, decay=0, clipnorm=1.0))

pretrained_model.summary()

# add path equal to image_id
labels = [dict(item, **{"path": item["image_id"]}) for item in load_json(labels_path)]
training, validation = train_test_split(labels, test_size=0.05, shuffle=True)

train_data_gen = _DataGenerator(
    training,
    batch_size=options.batch_size,
    base_dir=options.image_path,
    n_classes=classes,
    basenet_preprocess=_mobilenet.preprocess_input,
)

validation_data_gen = _DataGenerator(
    validation,
    batch_size=options.batch_size,
    base_dir=options.image_path,
    n_classes=classes,
    basenet_preprocess=_mobilenet.preprocess_input,
    training=False,
)

train_job_dir = f"train_jobs/{datetime.datetime.now().isoformat()}"
train_job_dir = os.path.join(options.results_path, train_job_dir)

tensorboard = TensorBoardBatch(log_dir=os.path.join(train_job_dir, "logs"))

model_save_name = "weights_{epoch:02d}_{val_loss:.3f}.hdf5"
model_file_path = os.path.join(train_job_dir, "weights", model_save_name)

if not os.path.exists(os.path.join(train_job_dir, "weights")):
    os.makedirs(os.path.join(train_job_dir, "weights"))
model_checkpointer = ModelCheckpoint(
    filepath=model_file_path,
    monitor="val_loss",
    verbose=1,
    save_best_only=True,
    save_weights_only=True,
)

pretrained_model.fit_generator(
    train_data_gen,
    steps_per_epoch=len(training) / options.batch_size / 10,
    epochs=5,
    verbose=1,
    callbacks=[tensorboard, model_checkpointer],
    validation_data=validation_data_gen,
    validation_steps=len(validation) / options.batch_size,
)


# start training all layers
for layer in base_model.layers:
    layer.trainable = True

pretrained_model.compile(
    loss=loss, optimizer=Adam(lr=0.0001, decay=0.000023, clipnorm=1.0)
)

pretrained_model.summary()

pretrained_model.fit_generator(
    train_data_gen,
    steps_per_epoch=len(training) / options.batch_size / 10,
    epochs=25,
    initial_epoch=5,
    verbose=1,
    callbacks=[tensorboard, model_checkpointer],
    validation_data=validation_data_gen,
    validation_steps=len(validation) / options.batch_size,
)

Update and followup

The original problem seemed to have been caused by too little available memory on the machine. I do have a somehow unrelated, yet related question though. When trying to use GPU acceleration I have been banging my head against the wall, as I can't seem to get it working.

Is there any good (logically structured and easy to follow) information out there how one would use:

Docker on a local machine (to build a GPU-accelerated enabled image)
Install all the relevant (nvidia-)drivers on the GPU instance (what an insane version chaos)
Run the Docker container (nvidia-docker2, nvidia-docker or --runtime==nvidia ?? )
What the hell is Cuda and why do I need it?
Some sources that I found suggested to run Cuda in Docker, why?

When I seemed like I got some of it working (i.e. set up drivers, some version) and had managed to build a GPU-enabled (i.e. tensorflow-gpu) Docker image I got this error:

docker: Error response from daemon: OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:430: container init caused \"process_linux.go:413: running prestart hook 1 caused \\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig.real --device=all --compute --utility --require=cuda>=10.0 brand=tesla,driver>=384,driver<385 brand=tesla,driver>=410,driver<411 --pid=2113 /var/lib/docker/overlay2/4bf49d2555c40278b3249f73bf3d33484181f51b374b77b69a474fc39e37441b/merged]\\nnvidia-container-cli: requirement error: unsatisfied condition: driver >= 410\\n\\"\"": unknown.

If I had to guess, I would say that you ran out of memory. Could you try to allocate more? — rvinas, Jul 31 '19 at 16:24
ya, that suspicion has been creeping up on me also. Sometimes I get std::bad_alloc or so errors. I didn't really want to ramp up a different instance tbh. — Fabian Bosler, Jul 31 '19 at 17:00
its actually already fairly low, but went with the larger memory size now. lets see now that goes — Fabian Bosler, Jul 31 '19 at 19:36
Maybe something with `Tensorboard` and lacking disk memory after saving data (doubt it's MobileNet as it's a tiny model)? It looks as though it crashes on the start of the epoch, maybe that's related? — Szymon Maszke, Jul 31 '19 at 22:38
It turns out it was a memory issue indeed (moved to a larger instance). Somebody can write up an answer if they want. I'll add a bonus question though :) — Fabian Bosler, Aug 01 '19 at 06:09

score 1 · Answer 1 · answered Aug 02 '19 at 04:12

Let me give you a simple solution to your HUGE problem(now that you have solved memory issue, even though I think training with only 3 million params on a large instance should not give you problems):

Install Conda.

So, whats happening here is your cuda docker is not compatible with your nvidia drivers or vice versa. Installing cuda is a pain process(I think many people can relate with me here). But you can install cuda compatible versions of tensorflow and pytorch easily using conda.

Here is a personal set of commands that I use every time when setting up a cloud instance:

For python 2.x:

wget https://repo.anaconda.com/miniconda/Miniconda2-latest-Linux-x86.sh

bash Miniconda2-latest-Linux-x86_64.sh

(if conda not found use the command 'bash' then type in 'conda --version' to check)

 conda install numpy
 conda install tensorflow-gpu

For Python3:

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
conda create -n TFGPU -c defaults tensorflow-gpu
conda activate TFGPU
conda install pytorch torchvision cudatoolkit=9.0 -c pytorch
conda install jupyter
conda install keras

You can check the console output by verifying :

$python3
>>>import tensorflow as tf
>>>sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

This should probably fix all your errors.

Else if there are some nvidia driver problems, you can install nvidia-smi manually:

#!/bin/bash
echo "Checking for CUDA and installing."
# Check for CUDA and try to install.
if ! dpkg-query -W cuda-9-0; then
  # The 16.04 installer works with 16.10.
  curl -O http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_9.0.176-1_amd64.deb
  dpkg -i ./cuda-repo-ubuntu1604_9.0.176-1_amd64.deb
  apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub
  apt-get update
  apt-get install cuda-9-0 -y
fi
# Enable persistence mode
nvidia-smi -pm 1

Manoj Mohan · Accepted Answer · 2019-08-02T05:47:44.517

nvidia-container-cli: requirement error: unsatisfied condition: driver>= 410

The CUDA/Driver/GPU compatibility matrix is available here: https://github.com/NVIDIA/nvidia-docker/wiki/CUDA#requirements

Regarding your questions:

Docker on a local machine (to build a GPU-accelerated enabled image)

I usually install docker-ce (Community edition) on my Ubuntu machine. The instructions here are straightforward: https://docs.docker.com/install/linux/docker-ce/ubuntu/

Install all the relevant (nvidia-)drivers on the GPU instance (what an insane version chaos)

It's better to install nvidia-drivers and CUDA in one go by downloading the installer for your OS from here. This way you wouldn't encounter CUDA-driver mismatch issues. https://developer.nvidia.com/cuda-downloads (e.g. on Ubuntu, sudo apt-get install cuda)

Run the Docker container (nvidia-docker2, nvidia-docker or --runtime==nvidia ?? )

nvidia-docker2 supersedes nvidia-docker. It maps the GPU device(/dev/nvidiaX) into the Docker container and also sets the runtime to nvidia. Usage of --runtime=nvidia is required only if docker command is used. If you're using Docker version 19.03 or later then nvidia-docker2 is not required as mentioned in the Quickstart section here: https://github.com/NVIDIA/nvidia-docker/.

What the hell is Cuda and why do I need it?

CUDA Toolkit is the parallel programming toolkit created by NVIDIA for GPU programming. Deep Learning frameworks like Tensorflow and Pytorch internally use this for running your code(model training) on the GPU.

https://devblogs.nvidia.com/even-easier-introduction-cuda/

Some sources that I found suggested to run Cuda in Docker, why?

As mentioned in the previous answer, as the DL frameworks execute within the Docker container the CUDA toolkit as well needs to run within the container so that the frameworks can use the toolkit functionality. The block diagram available at this link is very helpful to visualize this: https://github.com/NVIDIA/nvidia-docker/ https://cloud.githubusercontent.com/assets/3028125/12213714/5b208976-b632-11e5-8406-38d379ec46aa.png . The GPU driver sits on top of the host OS, the Docker container hosts the CUDA toolkit and the Applications(Model training/inference code written in a DL framework like Tensorflow or PyTorch)

Thank you for addressing my rant 1by1 :) Couple of follow-ups: Building the Image locally, is it a problem that I don't have the required hardware locally (and can the image still be built, altough it can't be run?) Could you elaborate a little bit more on If you're using Docker version 19.03 or later then nvidia-docker2 is not required as mentioned in the Quickstart section here: https://github.com/NVIDIA/nvidia-docker/.? It says nvidia-docker2 is depreceated ... so what is what here? Love the block diagram, that really helps. — Fabian Bosler, Aug 02 '19 at 11:53
:). Yes, 'docker build' could be done on a machine without GPU. nvidia-docker2 is not required because GPU devices can be directly specified in docker run command like this: $ docker run --gpus 2 nvidia/cuda:9.0-base nvidia-smi — Manoj Mohan, Aug 02 '19 at 12:28
Thx! You seem to have a solid grasp on the topic. Would you mind to extend your above answer regards to best practices (especially, where there are multiple options) — Fabian Bosler, Aug 02 '19 at 12:32

score 0 · Answer 3 · answered Aug 02 '19 at 11:05

0

what is the size of data set you are retraining using transfer learning. I had the same problem in that instance, reducing the batch size solved my problem.

answered Aug 02 '19 at 11:05

Sushant Agarwal

440
4
10

Keras Model stops training without indication as to why and how to enable GPU-acceleration

3 Answers3