1

I built a Machine Learning Model with Tensorflow CNN (MobileNetV3), with around 1000 images (one image contains multiple objects), 80 classes, and with 1024x1024 pixels per image. I was using 50000 epoch to train the model. When I finished the training, here is the output:

INFO:tensorflow:global step 50000: loss = 0.2869 (1.634 sec/step)
I0511 21:57:51.769988 140317106508416 learning.py:512] global step 50000: loss = 0.2869 (1.634 sec/step)
INFO:tensorflow:Stopping Training.
I0511 21:57:51.777392 140317106508416 learning.py:769] Stopping Training.
INFO:tensorflow:Finished training! Saving model to disk.
I0511 21:57:51.777826 140317106508416 learning.py:777] Finished training! Saving model to disk.
INFO:tensorflow:Recording summary at step 50000.
I0511 21:57:52.218328 140281886275328 supervisor.py:1050] Recording summary at step 50000.
/home/jupyter-20523033/.conda/envs/tf115/lib/python3.7/site-packages/tensorflow_core/python/summary/writer/writer.py:386: UserWarning: Attempting to use a closed FileWriter. The operation will be a noop unless the FileWriter is explicitly reopened.
  warnings.warn("Attempting to use a closed FileWriter. "
Traceback (most recent call last):
  File "train.py", line 186, in <module>
    tf.app.run()
  File "/home/jupyter-20523033/.conda/envs/tf115/lib/python3.7/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/home/jupyter-20523033/.conda/envs/tf115/lib/python3.7/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/home/jupyter-20523033/.conda/envs/tf115/lib/python3.7/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/home/jupyter-20523033/.conda/envs/tf115/lib/python3.7/site-packages/tensorflow_core/python/util/deprecation.py", line 324, in new_func
    return func(*args, **kwargs)
  File "train.py", line 182, in main
    graph_hook_fn=graph_rewriter_fn)
  File "/home/jupyter-20523033/.conda/envs/tf115/lib/python3.7/site-packages/object_detection/legacy/trainer.py", line 415, in train
    saver=saver)
  File "/home/jupyter-20523033/.conda/envs/tf115/lib/python3.7/site-packages/tf_slim/learning.py", line 782, in train
    ignore_live_threads=ignore_live_threads)
  File "/home/jupyter-20523033/.conda/envs/tf115/lib/python3.7/site-packages/tensorflow_core/python/training/supervisor.py", line 839, in stop
    ignore_live_threads=ignore_live_threads)
  File "/home/jupyter-20523033/.conda/envs/tf115/lib/python3.7/site-packages/tensorflow_core/python/training/coordinator.py", line 397,in join
    " ".join(stragglers))
RuntimeError: Coordinator stopped with threads still running: QueueRunnerThread-dummy_queue-sync_token_q_EnqueueMany

I want to evaluate the model that I have trained. When I run the eval.py file from MobileNetV3, there is an error:

INFO:tensorflow:Restoring parameters from ./training-3/model.ckpt-50000
I0512 06:50:41.675391 139735538098816 saver.py:1284] Restoring parameters from ./training-3/model.ckpt-50000
2023-05-12 06:50:46.295349: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2023-05-12 06:50:47.141975: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2023-05-12 06:50:47.146036: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2023-05-12 06:50:47.149937: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2023-05-12 06:50:47.154513: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2023-05-12 06:50:47.212965: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2023-05-12 06:50:47.225309: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2023-05-12 06:50:47.233639: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
INFO:tensorflow:# success: 0
I0512 06:50:47.259221 139735538098816 eval_util.py:378] # success: 0
INFO:tensorflow:# skipped: 0
I0512 06:50:47.259403 139735538098816 eval_util.py:379] # skipped: 0
W0512 06:50:47.259688 139735538098816 object_detection_evaluation.py:1286] The following classes have no ground truth examples: [  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108
 109 110 111 112 113 114 115 116 117 118]
/home/jupyter-20523033/.conda/envs/tf115/lib/python3.7/site-packages/object_detection/utils/metrics.py:145: RuntimeWarning: invalid value encountered in true_divide
  num_images_correctly_detected_per_class / num_gt_imgs_per_class)
/home/jupyter-20523033/.conda/envs/tf115/lib/python3.7/site-packages/object_detection/utils/object_detection_evaluation.py:1337: RuntimeWarning: Mean of empty slice
  mean_ap = np.nanmean(self.average_precision_per_class)
/home/jupyter-20523033/.conda/envs/tf115/lib/python3.7/site-packages/object_detection/utils/object_detection_evaluation.py:1338: RuntimeWarning: Mean of empty slice
  mean_corloc = np.nanmean(self.corloc_per_class)
Traceback (most recent call last):
  File "/home/jupyter-20523033/.conda/envs/tf115/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/home/jupyter-20523033/.conda/envs/tf115/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/home/jupyter-20523033/.conda/envs/tf115/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
  (0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
     [[{{node FeatureExtractor/MobilenetV3/Conv/Conv2D}}]]
     [[Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/Reshape_88/_1017]]
  (1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
     [[{{node FeatureExtractor/MobilenetV3/Conv/Conv2D}}]]
0 successful operations.
0 derived errors ignored.

There is another error:

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "eval.py", line 142, in <module>
    tf.app.run()
  File "/home/jupyter-20523033/.conda/envs/tf115/lib/python3.7/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/home/jupyter-20523033/.conda/envs/tf115/lib/python3.7/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)

but I can't show another line of error because it will be detected as spam

anyone know the problem and the solution?

Additional Information: I am using Tenforflow 1.15 GPU and python 3.7.16.

0 Answers0