1

I am facing a peculiar problem, where I run a job that performs deep learning model training under tensorflow and that job dies prematurely without any apparent warning. There is not syntactical error in the code, but the job running it dies after half hour from the start of the job. The syslog does not show anything pertaining to the problem I am facing, but I do see large gaps in syslog timestamp around the time my job fails.

I am connected to a Ubuntu 18.04 LTS server through ssh. Even when I logout or stay connected to the server, my job dies after 30-40 mins. The one thing that I see consistently in the syslog around the time my job fails is the Airflow_Temperature_Cel warning.

I'd appreciate any help regarding this freak issue I am facing.

user3317287
  • 429
  • 1
  • 5
  • 14

0 Answers0