I built a Machine Learning Model with Tensorflow CNN (MobileNetV3), with around 1000 images (one image contains multiple objects), 80 classes, and with 1024x1024 pixels per image. I was using 50000 epoch to train the model. When I finished the training, here is the output:
INFO:tensorflow:global step 50000: loss = 0.2869 (1.634 sec/step)
I0511 21:57:51.769988 140317106508416 learning.py:512] global step 50000: loss = 0.2869 (1.634 sec/step)
INFO:tensorflow:Stopping Training.
I0511 21:57:51.777392 140317106508416 learning.py:769] Stopping Training.
INFO:tensorflow:Finished training! Saving model to disk.
I0511 21:57:51.777826 140317106508416 learning.py:777] Finished training! Saving model to disk.
INFO:tensorflow:Recording summary at step 50000.
I0511 21:57:52.218328 140281886275328 supervisor.py:1050] Recording summary at step 50000.
/home/jupyter-20523033/.conda/envs/tf115/lib/python3.7/site-packages/tensorflow_core/python/summary/writer/writer.py:386: UserWarning: Attempting to use a closed FileWriter. The operation will be a noop unless the FileWriter is explicitly reopened.
warnings.warn("Attempting to use a closed FileWriter. "
Traceback (most recent call last):
File "train.py", line 186, in <module>
tf.app.run()
File "/home/jupyter-20523033/.conda/envs/tf115/lib/python3.7/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/home/jupyter-20523033/.conda/envs/tf115/lib/python3.7/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/home/jupyter-20523033/.conda/envs/tf115/lib/python3.7/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "/home/jupyter-20523033/.conda/envs/tf115/lib/python3.7/site-packages/tensorflow_core/python/util/deprecation.py", line 324, in new_func
return func(*args, **kwargs)
File "train.py", line 182, in main
graph_hook_fn=graph_rewriter_fn)
File "/home/jupyter-20523033/.conda/envs/tf115/lib/python3.7/site-packages/object_detection/legacy/trainer.py", line 415, in train
saver=saver)
File "/home/jupyter-20523033/.conda/envs/tf115/lib/python3.7/site-packages/tf_slim/learning.py", line 782, in train
ignore_live_threads=ignore_live_threads)
File "/home/jupyter-20523033/.conda/envs/tf115/lib/python3.7/site-packages/tensorflow_core/python/training/supervisor.py", line 839, in stop
ignore_live_threads=ignore_live_threads)
File "/home/jupyter-20523033/.conda/envs/tf115/lib/python3.7/site-packages/tensorflow_core/python/training/coordinator.py", line 397,in join
" ".join(stragglers))
RuntimeError: Coordinator stopped with threads still running: QueueRunnerThread-dummy_queue-sync_token_q_EnqueueMany
I want to evaluate the model that I have trained. When I run the eval.py file from MobileNetV3, there is an error:
INFO:tensorflow:Restoring parameters from ./training-3/model.ckpt-50000
I0512 06:50:41.675391 139735538098816 saver.py:1284] Restoring parameters from ./training-3/model.ckpt-50000
2023-05-12 06:50:46.295349: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2023-05-12 06:50:47.141975: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2023-05-12 06:50:47.146036: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2023-05-12 06:50:47.149937: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2023-05-12 06:50:47.154513: E tensorflow/stream_executor/cuda/cuda_blas.cc:238] failed to create cublas handle: CUBLAS_STATUS_NOT_INITIALIZED
2023-05-12 06:50:47.212965: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2023-05-12 06:50:47.225309: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2023-05-12 06:50:47.233639: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
INFO:tensorflow:# success: 0
I0512 06:50:47.259221 139735538098816 eval_util.py:378] # success: 0
INFO:tensorflow:# skipped: 0
I0512 06:50:47.259403 139735538098816 eval_util.py:379] # skipped: 0
W0512 06:50:47.259688 139735538098816 object_detection_evaluation.py:1286] The following classes have no ground truth examples: [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108
109 110 111 112 113 114 115 116 117 118]
/home/jupyter-20523033/.conda/envs/tf115/lib/python3.7/site-packages/object_detection/utils/metrics.py:145: RuntimeWarning: invalid value encountered in true_divide
num_images_correctly_detected_per_class / num_gt_imgs_per_class)
/home/jupyter-20523033/.conda/envs/tf115/lib/python3.7/site-packages/object_detection/utils/object_detection_evaluation.py:1337: RuntimeWarning: Mean of empty slice
mean_ap = np.nanmean(self.average_precision_per_class)
/home/jupyter-20523033/.conda/envs/tf115/lib/python3.7/site-packages/object_detection/utils/object_detection_evaluation.py:1338: RuntimeWarning: Mean of empty slice
mean_corloc = np.nanmean(self.corloc_per_class)
Traceback (most recent call last):
File "/home/jupyter-20523033/.conda/envs/tf115/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/home/jupyter-20523033/.conda/envs/tf115/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/home/jupyter-20523033/.conda/envs/tf115/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: 2 root error(s) found.
(0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node FeatureExtractor/MobilenetV3/Conv/Conv2D}}]]
[[Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/Reshape_88/_1017]]
(1) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node FeatureExtractor/MobilenetV3/Conv/Conv2D}}]]
0 successful operations.
0 derived errors ignored.
There is another error:
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "eval.py", line 142, in <module>
tf.app.run()
File "/home/jupyter-20523033/.conda/envs/tf115/lib/python3.7/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/home/jupyter-20523033/.conda/envs/tf115/lib/python3.7/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
but I can't show another line of error because it will be detected as spam
anyone know the problem and the solution?
Additional Information: I am using Tenforflow 1.15 GPU and python 3.7.16.