I have been train a convolution neural network I developed using tensorflow on Google cloud ML engine
for a while but it's not working. My job runs successfully till it reaches the point in my code where I'm running training ops, eval ops etc. on my cnn model and then it just stays stuck there for some unknown reason.
This is how I'm submitting the job:
gcloud ml-engine jobs submit training "test_job9" --job-dir gs://project/output --runtime-version 1.2 --module-name trainer.train_cnn_cloud --package-path ./trainer --region us-east1 --scale-tier BASIC_GPU -- --trainpath gs://project/data/train.tfrecords --evalpath gs://project/data/valid.tfrecords
Here's how my cpu and memory utilization looks while the job is running:
Here's how it looked for a past job after I canceled the job:
Here's the logs from the past job:
As you can see, it assigns all inputs to the cpu but all variables used in my CNN model, which I explicitly assigned to gpu in my code, are assigned to the gpu.
You can see where it gets stuck in the code based on the logging I used:
logging.info("\nStart training (save checkpoints in %s)\n" % self.config['jobdir'])
for _ in range(self.config['numepochs']):
sess.run(train_iterator.initializer)
while True:
try:
start_time = time.time()
train_input_vals, train_label_vals = sess.run([train_features['input'], train_labels])
logging.info('Training vals retrieved')
feed_dict = {m._inputs:train_input_vals,
m._labels:train_label_vals,
m.batch_size:self.config['train_batch_size'],
m.dropout_keep:self.config['dropout']}
_, loss_value, eval_ops, predictions, current_lr, curlabels= sess.run([m._train_op, m._total_loss,
m._eval_op, m._predictions,
m._learning_rate,
m._labels_class1],
feed_dict)
logging.info('loss retrieved')
global_step += 1
It gets stuck right after retrieving the inputs prior to running the train and eval ops.
The code runs successfully on my laptop. Of note, I used python 3.6
to run the code on my laptop which has Windows 10
as OS while gcloud ml-engine
uses python 2.7
and Ubuntu
. I had an error because of that in a past run but I think the use of _future_
fixed that.
Thanks a lot for looking into this!