0

I was initially starting to train with google colab free. WHile I started another training and upgraded to colab pro for better efficiency. I was offered the a100 GPU and I simply ran the old code and tried to get the training started. Then it shows some error regarding the pytorch version. I tried to fix it and it seems the problem is gone. However, the training process starts as expected, but it suddenly stops without any specific error messages. I have checked the system resources and they seem to be sufficient for the training process. I have also provided the log in a previous message.

What could be the cause of the issue, and how can I fix it to resume training? Any help or suggestions would be greatly appreciated!

Initially when i used the old code, the pytorch version was incorrect according to the error, so it changes from !pip install torch==1.8.1 torchvision==0.9.1 !git clone https://github.com/NVlabs/stylegan2-ada-pytorch.git !pip install ninja to !pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html !git clone https://github.com/NVlabs/stylegan2-ada-pytorch.git !pip install ninja !apt install imagemagick !pip uninstall setuptools !pip install setuptools==59.5.0 Then the training seemed to be starting but it just stopped after tick 0. Here is the last part of the log. tick 0 kimg 0.0 time 1m 13s sec/tick 6.2 sec/kimg 771.74 maintenance 66.4 cpumem 6.29 gpumem 34.71 augment 0.000 Evaluating metrics... /usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py:1051: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.) return forward_call(*input, **kwargs)

Bryan Y
  • 9
  • 1

0 Answers0