GAN Training with Tensorflow 1.4 inside Docker Stops without Prompting and without Releasing Memory connected to VM with SSH Connection

Question

Project Detail

I am running open source code of A GAN based Research Paper named "Investigating Generative Adversarial Networks based Speech Dereverberation for Robust Speech Recognition"
source code: here
The dependencies include:

Python 2.7
TensorFlow 1.4.0

I pulled a Docker Image of TensorFlow 1.4.0 with python 2.7 on my GPU Virtual Machine Connected with ssh connection with this command :

docker pull tensorflow/tensorflow:1.4.0-gpu

I am running

bash rsrgan/run_gan_rnn_placeholder.sh

according to readme of source code

Issue's Detail

Everything is working, Model is Training and loss is decreasing, But there is only one issue that After some iterations terminal shows no output, GPU still shows PID but no Memory freed and sometime GPU-Utils becomes 0%. Training on VM's GPU and CPU are same case. It is not a memory issue Because GPU Memory usage by model is 5400MB out of 11,000MB and RAM for CPU is also very Big

When I ran 21 Iteration on my local Computer each iteration with 0.09 hours with 1st Gen i5 and 4GB RAM all iterations executed. But whenever I run it with ssh inside docker issue happens again and again with both GPU and CPU. Just keep in mind the issue is happening inside docker with computer connected with ssh and ssh is also not disconnect very often.

exact Numbers

If an iteration take 1.5 hour then issue happens after two to three iterations and if single iteration take 0.06 hours then issue happens exactly after 14 iteration out of 25

Hi! I have posted some suggestions. Feel free to drop a comment, will be happy to assist you further. Thanks. — Rishab P, Apr 04 '20 at 09:43

score 5 · Answer 1 · edited Jun 20 '20 at 09:12

Perform operations inside Docker container

The first thing you can try out is to build the Docker image and then enter inside the Docker container by specifying the -ti flag or /bin/bash parameter in your docker run command.

Clone the repository inside the container and while building the image you should also copy your training data from local to inside the docker. Run the training there and commit the changes so that you need not repeat the steps in future runs as after you exit from the container, all the changes are lost if not committed.

You can find the reference for docker commit here.

$ docker commit <container-id> <image-name:tag>

While training is going on check for the GPU and CPU utilization of the VM, see if everything is working as expected.

Use Anaconda environment on you VM

Anaconda is a great package manager. You can install anaconda and create a virtual environment and run your code in the virtual environment.

$ wget <url_of_anaconda.sh>

$ bash <path_to_sh>

$ source anaconda3/bin/activate or source anaconda2/bin/activate

$ conda create -n <env_name> python==2.7.*

$ conda activate <env_name>

Install all the dependencies via conda (recommended) or pip.

Run your code.

Q1: GAN Training with Tensorflow 1.4 inside Docker Stops without Prompting

Although Docker gives OS-level virtualization inside Docker, we face issues in running some processes which run with ease on the system. So to debug the issue you should go inside the image and performs the steps above in order to debug the problem.

Q2: Training stops without Releasing Memory connected to VM with SSH Connection

Ya, this is an issue I had also faced earlier. The best way to release memory is to stop the Docker container. You can find more resource allocation options here.

Also, earlier versions of TensorFlow had issues with allocating and clearing memory properly. You can find some reference here and here. These issues have been fixed in recent versions of TensorFlow.

Additionally, check for Nvidia bug reports

Step 1: Install Nvidia-utils installed via the following command. You can find the driver version from nvidia-smi output (also mentioned in the question.)

$ sudo apt install nvidia-utils-<driver-version>

Step 2: Run the nvidia-bug-report.sh script

$ sudo /usr/bin/nvidia-bug-report.sh

Log file will be generated in your current working directory with name nvidia-bug-report.log.gz. Also, you can access the installer log at /var/log/nvidia-installer.log.

You can find additional information about Nvidia logs at these links:

Hope this helps.

Thanks for giving answer. But I have already did all the steps before posting the Question. You were saying to run the docker I have already ran the docker, That is the real issue that inside docker **tensorflow/tensorflow:1.4.0-gpu** ran with docker run --runtime=nvidia, and inside docker training stops at specific iteration leaving the memeory consumption and leaving the Gpu-Utils to 0% as specified in upper picture in question — Ahwar, Apr 05 '20 at 07:41
Hi! I thought you have made a Dockerfile to perform all the steps and then you are running it. But actually you are running the bash script after going inside the container manually right? Also, have you tried running the script in a conda environment? — Rishab P, Apr 05 '20 at 08:13
I am doing this project for a client. They say you must run this code inside docker container with Tensorflow 1.4 with Nvidia gpu. I can't run with anaconda or outside docker container. — Ahwar, Apr 05 '20 at 11:11
Have you tried running the bash script manually after entering inside the container to check for error logs if any? Also, how are you debugging the code? — Rishab P, Apr 05 '20 at 13:57
Code is working as expected. Iterations are happening. but it get struck after some iteration. If it is a nvidia issue is there a way that i can see any log file about tensorflow training or gpu log files? — Ahwar, Apr 06 '20 at 10:26
You can get some help from this [link](https://unix.stackexchange.com/questions/252590/how-to-log-gpu-load). — Rishab P, Apr 06 '20 at 10:44
Thanks a lot I have learnt many things with your comments. Upper links are very usefull but that only logs the usage and memory consumption. I was saying that is there a way by which If my tf model training stops at any time then I can see tf or nvidia's log files to see which thing caused the error. Or what was happening when it got stopped. — Ahwar, Apr 06 '20 at 14:27
Hi! Added few details about Nvidia's log files in the answer. Hope you will find it useful. Feel free to drop a comment below. — Rishab P, Apr 06 '20 at 16:19