INFO:tensorflow:Timed-out waiting for a checkpoint

Question

I'm running on a MacBook Pro with an old GPU (NVIDIA GeForce 9600M GT 512 MB) with CUDA 4.5 on OS X 10.11.6. (Tensorflow requires CUDA 7.5 or greater to use a GPU).

I got this error training a Magenta model in Tensorflow:

INFO:tensorflow:Timed-out waiting for a checkpoint.

Here's my command and output.

$ bazel run //magenta/models/melody_rnn:melody_rnn_train -- --config=attention_rnn --run_dir=/tmp/melody_rnn/logdir/run1 --sequence_example_file=/Users/davidlaxer/magenta/magenta/testdata/notesequences.tfrecord --hparams="batch_size=10,rnn_layer_sizes=[64,64]" --num_training_steps=20000 --eval
INFO: Found 1 target...
Target //magenta/models/melody_rnn:melody_rnn_train up-to-date:
  bazel-bin/magenta/models/melody_rnn/melody_rnn_train
INFO: Elapsed time: 0.561s, Critical Path: 0.09s

INFO: Running command line: bazel-bin/magenta/models/melody_rnn/melody_rnn_train '--config=attention_rnn' '--run_dir=/tmp/melody_rnn/logdir/run1' '--sequence_example_file=/Users/davidlaxer/magenta/magenta/testdata/notesequences.tfrecord' '--hparams=batch_size=10,rnn_layer_sizes=[64,64]' '--num_training_steps=20000' --eval
INFO:tensorflow:hparams = {'rnn_layer_sizes': [64, 64], 'attn_length': 40, 'dropout_keep_prob': 0.5, 'batch_size': 10, 'clip_norm': 3, 'learning_rate': 0.001}
INFO:tensorflow:[<tf.Tensor 'ParseSingleSequenceExample/ParseSingleSequenceExample:0' shape=(?, 74) dtype=float32>, <tf.Tensor 'ParseSingleSequenceExample/ParseSingleSequenceExample:1' shape=(?,) dtype=int64>, <tf.Tensor 'strided_slice:0' shape=() dtype=int32>]
INFO:tensorflow:Train dir: /tmp/melody_rnn/logdir/run1/train
INFO:tensorflow:Eval dir: /tmp/melody_rnn/logdir/run1/eval
INFO:tensorflow:Counting records in /Users/davidlaxer/magenta/magenta/testdata/notesequences.tfrecord.
INFO:tensorflow:Total records: 46
INFO:tensorflow:Waiting for new checkpoint at /tmp/melody_rnn/logdir/run1/train
INFO:tensorflow:Timed-out waiting for a checkpoint.
David-Laxers-MacBook-Pro:magenta davidlaxer$

What is the cause of this error?

Also tried adjusting the timeout:

$ bazel run //magenta/models/melody_rnn:melody_rnn_train -- --config=attention_rnn --run_dir=/tmp/melody_rnn/logdir/run1 --sequence_example_file=/Users/davidlaxer/magenta/magenta/testdata/notesequences.tfrecord --hparams="batch_size=10,rnn_layer_sizes=[64,64]" **--save_summaries_secs=10000 --save_interval_secs=10000** --num_training_steps=20000 --eval

I removed the --eval directive and it started to train the model:

$ ls -l /tmp/melody_rnn/logdir/run1/train/
total 11032
-rw-r--r--  1 davidlaxer  wheel      149 Jul 20 16:04 checkpoint
-rw-r--r--  1 davidlaxer  wheel  2438765 Jul 20 16:04 events.out.tfevents.1500591842.David-Laxers-MacBook-Pro.local
-rw-r--r--  1 davidlaxer  wheel  1300637 Jul 20 16:04 graph.pbtxt
-rw-r--r--  1 davidlaxer  wheel  1226008 Jul 20 16:04 model.ckpt-0.data-00000-of-00001
-rw-r--r--  1 davidlaxer  wheel     1727 Jul 20 16:04 model.ckpt-0.index
-rw-r--r--  1 davidlaxer  wheel   667410 Jul 20 16:04 model.ckpt-0.meta


$ bazel run //magenta/models/melody_rnn:melody_rnn_train -- --config=attention_rnn --run_dir=/tmp/melody_rnn/logdir/run1 --sequence_example_file=/Users/davidlaxer/magenta/magenta/testdata/notesequences.tfrecord --hparams="batch_size=10,rnn_layer_sizes=[64,64]" **--save_summaries_secs=10000 --save_interval_secs=10000** --num_training_steps=20000 
Killed non-responsive server process (pid=65119)
.
INFO: Found 1 target...
Target //magenta/models/melody_rnn:melody_rnn_train up-to-date:
  bazel-bin/magenta/models/melody_rnn/melody_rnn_train
INFO: Elapsed time: 9.400s, Critical Path: 0.65s

INFO: Running command line: bazel-bin/magenta/models/melody_rnn/melody_rnn_train '--config=attention_rnn' '--run_dir=/tmp/melody_rnn/logdir/run1' '--sequence_example_file=/Users/davidlaxer/magenta/magenta/testdata/notesequences.tfrecord' '--hparams=batch_size=10,rnn_layer_sizes=[64,64]' '**--save_summaries_secs=10000' '--save_interval_secs=10000**' '--num_training_steps=20000'
INFO:tensorflow:hparams = {'rnn_layer_sizes': [64, 64], 'attn_length': 40, 'dropout_keep_prob': 0.5, 'batch_size': 10, 'clip_norm': 3, 'learning_rate': 0.001}
INFO:tensorflow:Counting records in /Users/davidlaxer/magenta/magenta/testdata/notesequences.tfrecord.
INFO:tensorflow:Total records: 46
INFO:tensorflow:[<tf.Tensor 'random_shuffle_queue_Dequeue:0' shape=(?, 74) dtype=float32>, <tf.Tensor 'random_shuffle_queue_Dequeue:1' shape=(?,) dtype=int64>, <tf.Tensor 'random_shuffle_queue_Dequeue:2' shape=() dtype=int32>]
INFO:tensorflow:Train dir: /tmp/melody_rnn/logdir/run1/train
INFO:tensorflow:Starting training loop...
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Error reported to Coordinator: <type 'exceptions.UnicodeDecodeError'>, 'utf8' codec can't decode byte 0xe0 in position 132: invalid continuation byte
INFO:tensorflow:Saving checkpoints for 0 into /tmp/melody_rnn/logdir/run1/train/model.ckpt.
Traceback (most recent call last):
  File "/private/var/tmp/_bazel_davidlaxer/182280691ad889ad33cd20c0640dc2b1/execroot/magenta/bazel-out/local-opt/bin/magenta/models/melody_rnn/melody_rnn_train.runfiles/__main__/magenta/models/melody_rnn/melody_rnn_train.py", line 112, in <module>
    console_entry_point()
  File "/private/var/tmp/_bazel_davidlaxer/182280691ad889ad33cd20c0640dc2b1/execroot/magenta/bazel-out/local-opt/bin/magenta/models/melody_rnn/melody_rnn_train.runfiles/__main__/magenta/models/melody_rnn/melody_rnn_train.py", line 108, in console_entry_point
    tf.app.run(main)
  File "/Users/davidlaxer/anaconda/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/private/var/tmp/_bazel_davidlaxer/182280691ad889ad33cd20c0640dc2b1/execroot/magenta/bazel-out/local-opt/bin/magenta/models/melody_rnn/melody_rnn_train.runfiles/__main__/magenta/models/melody_rnn/melody_rnn_train.py", line 104, in main
    checkpoints_to_keep=FLAGS.num_checkpoints)
  File "/private/var/tmp/_bazel_davidlaxer/182280691ad889ad33cd20c0640dc2b1/execroot/magenta/bazel-out/local-opt/bin/magenta/models/melody_rnn/melody_rnn_train.runfiles/__main__/magenta/models/shared/events_rnn_train.py", line 71, in run_training
    save_summaries_steps=summary_frequency)
  File "/Users/davidlaxer/anaconda/lib/python2.7/site-packages/tensorflow/contrib/training/python/training/training.py", line 530, in train
    loss = session.run(train_op)
  File "/Users/davidlaxer/anaconda/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 521, in __exit__
    self._close_internal(exception_type)
  File "/Users/davidlaxer/anaconda/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 556, in _close_internal
    self._sess.close()
  File "/Users/davidlaxer/anaconda/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 791, in close
    self._sess.close()
  File "/Users/davidlaxer/anaconda/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 888, in close
    ignore_live_threads=True)
  File "/Users/davidlaxer/anaconda/lib/python2.7/site-packages/tensorflow/python/training/coordinator.py", line 389, in join
    six.reraise(*self._exc_info_to_raise)
  File "/Users/davidlaxer/anaconda/lib/python2.7/site-packages/tensorflow/python/training/queue_runner_impl.py", line 238, in _run
    enqueue_callable()
  File "/Users/davidlaxer/anaconda/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1063, in _single_operation_run
    target_list_as_strings, status, None)
  File "/Users/davidlaxer/anaconda/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/Users/davidlaxer/anaconda/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 465, in raise_exception_on_not_ok_status
    compat.as_text(pywrap_tensorflow.TF_Message(status)),
  File "/Users/davidlaxer/anaconda/lib/python2.7/site-packages/tensorflow/python/util/compat.py", line 84, in as_text
    return bytes_or_text.decode(encoding)
  File "/Users/davidlaxer/anaconda/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe0 in position 132: invalid continuation byte
ERROR: Non-zero return code '1' from command: Process exited with status 1.

score 2 · Accepted Answer · answered Jul 20 '17 at 17:26

2

When --eval is specified, you are running the evaluation not training. The eval job will wait for checkpoints in the run_dir and if no checkpoint is found it would just exit.

answered Jul 20 '17 at 17:26

yuefengz

3,338
1
17
24

Thank You! It started to train... Does this mean my 'sample file': magenta/magenta/testdata/notesequences.tfrecord' has a problem? UnicodeDecodeError: 'utf8' codec can't decode byte 0xe0 in position 132: invalid continuation byte – dbl001 Jul 20 '17 at 17:33
Your data has no problem. The eval job would expect checkpoint produced by the train job. You have to run train job first or at the same time with the eval job. – yuefengz Jul 20 '17 at 22:26
My question was not clear. Please review the latest edit to the question to see the UnicodeDecodeError. – dbl001 Jul 20 '17 at 23:08

INFO:tensorflow:Timed-out waiting for a checkpoint

1 Answers1