0

I have been train a convolution neural network I developed using tensorflow on Google cloud ML engine for a while but it's not working. My job runs successfully till it reaches the point in my code where I'm running training ops, eval ops etc. on my cnn model and then it just stays stuck there for some unknown reason.

This is how I'm submitting the job:

gcloud ml-engine jobs submit training "test_job9" --job-dir gs://project/output --runtime-version 1.2 --module-name trainer.train_cnn_cloud --package-path ./trainer --region us-east1 --scale-tier BASIC_GPU -- --trainpath gs://project/data/train.tfrecords --evalpath gs://project/data/valid.tfrecords

Here's how my cpu and memory utilization looks while the job is running:

enter image description here

Here's how it looked for a past job after I canceled the job: enter image description here

Here's the logs from the past job:

enter image description here

As you can see, it assigns all inputs to the cpu but all variables used in my CNN model, which I explicitly assigned to gpu in my code, are assigned to the gpu.

You can see where it gets stuck in the code based on the logging I used:

 logging.info("\nStart training (save checkpoints in %s)\n" % self.config['jobdir'])
                for _ in range(self.config['numepochs']):
                    sess.run(train_iterator.initializer)
                    while True:
                        try:
                            start_time = time.time()
                            train_input_vals, train_label_vals = sess.run([train_features['input'], train_labels])
                            logging.info('Training vals retrieved')
                            feed_dict = {m._inputs:train_input_vals,
                                         m._labels:train_label_vals,
                                         m.batch_size:self.config['train_batch_size'],
                                         m.dropout_keep:self.config['dropout']}
                            _, loss_value, eval_ops, predictions, current_lr, curlabels= sess.run([m._train_op, m._total_loss,
                                                                                                   m._eval_op, m._predictions,
                                                                                                   m._learning_rate,
                                                                                                   m._labels_class1],
                                                                                                  feed_dict)
                            logging.info('loss retrieved')
                            global_step += 1

It gets stuck right after retrieving the inputs prior to running the train and eval ops.

The code runs successfully on my laptop. Of note, I used python 3.6 to run the code on my laptop which has Windows 10 as OS while gcloud ml-engine uses python 2.7 and Ubuntu. I had an error because of that in a past run but I think the use of _future_ fixed that.

Thanks a lot for looking into this!

ltt
  • 417
  • 3
  • 12

0 Answers0