I'm trying to reproduce the training of the Mask RCNN in the following repository:https://github.com/maxkferg/metal-defect-detection
Code snippet for the train is the following:
# Training - Stage 1
print("Training network heads")
model.train(dataset_train, dataset_val,
learning_rate=config.LEARNING_RATE,
epochs=40,
layers='heads')
# Training - Stage 2
# Finetune layers from ResNet stage 4 and up
print("Fine tune Resnet stage 4 and up")
model.train(dataset_train, dataset_val,
learning_rate=config.LEARNING_RATE,
epochs=120,
layers='4+')
# # Training - Stage 3
# # Fine tune all layers
print("Fine tune all layers")
model.train(dataset_train, dataset_val,
learning_rate=config.LEARNING_RATE / 10,
epochs=160,
layers='all')
Stage-1 goes smooth. But fails from the Stage-2. Giving the following:
2020-08-17 15:53:10.685456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 123 Chunks of size 2048 totalling 246.0KiB 2020-08-17 15:53:10.685456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 2816 totalling 2.8KiB 2020-08-17 15:53:10.686456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 6 Chunks of size 3072 totalling 18.0KiB 2020-08-17 15:53:10.686456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 387 Chunks of size 4096 totalling 1.51MiB 2020-08-17 15:53:10.687456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 6144 totalling 6.0KiB 2020-08-17 15:53:10.687456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 6656 totalling 6.5KiB 2020-08-17 15:53:10.688456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 60 Chunks of size 8192 totalling 480.0KiB 2020-08-17 15:53:10.688456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 2 Chunks of size 9216 totalling 18.0KiB 2020-08-17 15:53:10.689456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 12 Chunks of size 12288 totalling 144.0KiB 2020-08-17 15:53:10.689456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 2 Chunks of size 16384 totalling 32.0KiB 2020-08-17 15:53:10.690456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 21248 totalling 20.8KiB 2020-08-17 15:53:10.691456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 24064 totalling 23.5KiB 2020-08-17 15:53:10.691456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 5 Chunks of size 24576 totalling 120.0KiB 2020-08-17 15:53:10.692456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 37632 totalling 36.8KiB 2020-08-17 15:53:10.692456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 40960 totalling 40.0KiB 2020-08-17 15:53:10.693456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 4 Chunks of size 49152 totalling 192.0KiB 2020-08-17 15:53:10.693456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 6 Chunks of size 65536 totalling 384.0KiB 2020-08-17 15:53:10.694456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 81920 totalling 80.0KiB 2020-08-17 15:53:10.695456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 90624 totalling 88.5KiB 2020-08-17 15:53:10.695456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 131072 totalling 128.0KiB 2020-08-17 15:53:10.695456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 3 Chunks of size 147456 totalling 432.0KiB 2020-08-17 15:53:10.696456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 12 Chunks of size 262144 totalling 3.00MiB 2020-08-17 15:53:10.696456: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 327680 totalling 320.0KiB 2020-08-17 15:53:10.697457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 11 Chunks of size 524288 totalling 5.50MiB 2020-08-17 15:53:10.697457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 4 Chunks of size 589824 totalling 2.25MiB 2020-08-17 15:53:10.698457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 194 Chunks of size 1048576 totalling 194.00MiB 2020-08-17 15:53:10.699457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 17 Chunks of size 2097152 totalling 34.00MiB 2020-08-17 15:53:10.699457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 2211840 totalling 2.11MiB 2020-08-17 15:53:10.700457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 146 Chunks of size 2359296 totalling 328.50MiB 2020-08-17 15:53:10.701457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 2360320 totalling 2.25MiB 2020-08-17 15:53:10.701457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 2621440 totalling 2.50MiB 2020-08-17 15:53:10.702457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 2698496 totalling 2.57MiB 2020-08-17 15:53:10.702457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 3670016 totalling 3.50MiB 2020-08-17 15:53:10.703457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 31 Chunks of size 4194304 totalling 124.00MiB 2020-08-17 15:53:10.703457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 6 Chunks of size 4718592 totalling 27.00MiB 2020-08-17 15:53:10.704457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 5 Chunks of size 8388608 totalling 40.00MiB 2020-08-17 15:53:10.705457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 25 Chunks of size 9437184 totalling 225.00MiB 2020-08-17 15:53:10.705457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 2 Chunks of size 9438208 totalling 18.00MiB 2020-08-17 15:53:10.706457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 9441280 totalling 9.00MiB 2020-08-17 15:53:10.706457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 16138752 totalling 15.39MiB 2020-08-17 15:53:10.707457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 18874368 totalling 18.00MiB 2020-08-17 15:53:10.707457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 1 Chunks of size 37748736 totalling 36.00MiB 2020-08-17 15:53:10.708457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:680] 7 Chunks of size 51380224 totalling 343.00MiB 2020-08-17 15:53:10.708457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:684] Sum Total of in-use chunks: 1.41GiB 2020-08-17 15:53:10.709457: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:686] Stats: Limit: 1613615104 InUse: 1510723072 MaxInUse: 1510723072 NumAllocs: 3860 MaxAllocSize: 119947776
The training is running on a QuadroK420 with 2GB of RAM. Is only a problem of low RAM or I'm missing something? There is a way to train also with my equippement?