I'm trying to run a training session using keras
the total dataset(1000 80 x 80 images) is very small (20 mb in total) on Amazon ec2 free tier cloud instance(1GB of memory) however The process gets killed after running model.fit()
2 epochs(and it varies sometimes it keeps running up to 15). I'm trying to disable the oom killer or find some workarounds ... any suggestions? You'll find below the memory trace(which does not show some serious figures so i'm wondering why the script gets killed in the first place???)
Error:(reproducible on a 1GB memory instance)
64/870 [=>............................] - ETA: 12s - loss: 0.4477 - accuracy: 0.8750Traceback (most recent call last):
File "image_classifier.py", line 990, in <module>
clf.predict_folder_k_cnn(folder_path='test_photos_2/', label='One', epochs=50)
File "image_classifier.py", line 951, in predict_folder_k_cnn
model.fit(self.x_train, self.y_train, epochs=epochs, batch_size=batch_size, **(model_fit_args or {}))
File "/home/ec2-user/.local/lib/python3.7/site-packages/keras/engine/training.py", line 1239, in fit
validation_freq=validation_freq)
File "/home/ec2-user/.local/lib/python3.7/site-packages/keras/engine/training_arrays.py", line 196, in fit_loop
outs = fit_function(ins_batch)
File "/home/ec2-user/.local/lib/python3.7/site-packages/tensorflow/python/keras/backend.py", line 3510, in __call__
outputs = self._graph_fn(*converted_inputs)
File "/home/ec2-user/.local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 572, in __call__
return self._call_flat(args)
File "/home/ec2-user/.local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 671, in _call_flat
outputs = self._inference_function.call(ctx, args)
File "/home/ec2-user/.local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 445, in call
ctx=ctx)
File "/home/ec2-user/.local/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 67, in quick_execute
six.raise_from(core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[64,80,80,32] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
[[node gradients/max_pool/MaxPool_grad/MaxPoolGrad (defined at /home/ec2-user/.local/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:3009) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[Op:__inference_keras_scratch_graph_1638]
Function call stack:
keras_scratch_graph
dmesg
output:
t:15988kB managed:15904kB mlocked:0kB kernel_stack:0kB pagetables:16kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 504.825883] lowmem_reserve[]: 0 932 932 932
[ 504.829525] Node 0 DMA32 free:44316kB min:44316kB low:55392kB high:66468kB active_anon:892184kB inactive_anon:256kB active_file:24kB inactive_file:0kB unevictable:0kB writepending:0kB present:1032192kB managed:991368kB mlocked:0kB kernel_stack:1952kB pagetables:7124kB bounce:0kB free_pcp:120kB local_pcp:120kB free_cma:0kB
[ 504.851094] lowmem_reserve[]: 0 0 0 0
[ 504.854427] Node 0 DMA: 10*4kB (UME) 11*8kB (UME) 13*16kB (UME) 15*32kB (UE) 9*64kB (UE) 8*128kB (UME) 6*256kB (UME) 1*512kB (E) 0*1024kB 0*2048kB 0*4096kB = 4464kB
[ 504.865932] Node 0 DMA32: 1101*4kB (UE) 781*8kB (UE) 458*16kB (UE) 317*32kB (UE) 121*64kB (UME) 46*128kB (UME) 6*256kB (U) 0*512kB 1*1024kB (M) 0*2048kB 0*4096kB = 44316kB
[ 504.877626] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 504.884964] 103 total pagecache pages
[ 504.888296] 0 pages in swap cache
[ 504.891399] Swap cache stats: add 0, delete 0, find 0/0
[ 504.895970] Free swap = 0kB
[ 504.898881] Total swap = 0kB
[ 504.901907] 262045 pages RAM
[ 504.904737] 0 pages HighMem/MovableOnly
[ 504.908299] 10227 pages reserved
[ 504.911383] 0 pages hwpoisoned
[ 504.914445] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
[ 504.921393] [ 1931] 0 1931 10278 97 28 3 0 0 systemd-journal
[ 504.928934] [ 1961] 0 1961 29191 67 28 4 0 0 lvmetad
[ 504.936328] [ 2655] 0 2655 16041 149 30 3 0 -1000 auditd
[ 504.943150] [ 2683] 81 2683 15123 118 35 3 0 -900 dbus-daemon
[ 504.950385] [ 2686] 32 2686 18423 178 38 3 0 0 rpcbind
[ 504.957604] [ 2690] 999 2690 3152 41 12 3 0 0 lsmd
[ 504.964760] [ 2691] 0 2691 3274 28 12 3 0 0 rngd
[ 504.972138] [ 2693] 0 2693 7117 89 19 3 0 0 systemd-logind
[ 504.979632] [ 2700] 997 2700 30649 135 33 3 0 0 chronyd
[ 504.987111] [ 2716] 0 2716 24457 163 35 3 0 0 gssproxy
[ 504.994331] [ 2920] 0 2920 25156 514 48 3 0 0 dhclient
[ 505.001383] [ 2961] 0 2961 25156 510 48 3 0 0 dhclient
[ 505.008709] [ 3105] 0 3105 22545 262 44 3 0 0 master
[ 505.015992] [ 3109] 89 3109 22567 253 44 3 0 0 pickup
[ 505.022854] [ 3110] 89 3110 22586 256 46 3 0 0 qmgr
[ 505.029730] [ 3157] 0 3157 117174 442 30 6 0 0 amazon-ssm-agen
[ 505.037492] [ 3159] 0 3159 54140 270 41 3 0 0 rsyslogd
[ 505.044641] [ 3199] 0 3199 30322 32 12 3 0 0 agetty
[ 505.051767] [ 3200] 0 3200 2634 33 11 3 0 0 agetty
[ 505.059124] [ 3333] 0 3333 38138 334 76 3 0 0 sshd
[ 505.066299] [ 3371] 0 3371 1065 26 8 3 0 0 acpid
[ 505.073401] [ 3414] 1000 3414 38175 390 73 3 0 0 sshd
[ 505.082220] [ 3415] 1000 3415 31219 269 16 3 0 0 bash
[ 505.089459] [ 3564] 0 3564 11355 132 24 3 0 -1000 systemd-udevd
[ 505.097212] [ 4261] 0 4261 28182 254 59 4 0 -1000 sshd
[ 505.103965] [ 4396] 0 4396 33767 158 21 4 0 0 crond
[ 505.110852] [ 4421] 0 4421 6968 50 19 3 0 0 atd
[ 505.118310] [22988] 1000 22988 33586 64 21 3 0 0 screen
[ 505.125710] [22989] 1000 22989 33621 128 19 3 0 0 screen
[ 505.132826] [22990] 1000 22990 31215 270 16 3 0 0 bash
[ 505.140153] [23011] 1000 23011 568549 219738 812 5 0 0 python3
[ 505.147922] Out of memory: Kill process 23011 (python3) score 875 or sacrifice child
[ 505.154309] Killed process 23011 (python3) total-vm:2274196kB, anon-rss:878952kB, file-rss:0kB, shmem-rss:0kB
[ 505.195909] oom_reaper: reaped process 23011 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ec2-user@ip-172-31-95-14 ~]$ python3 image_classifier.py
Memory trace(1 epoch):
image_classifier.py:263: size=159 MiB, count=3, average=53.1 MiB
/home/ec2-user/.local/lib/python3.7/site-packages/tables/atom.py:1224: size=20.1 MiB, count=3715, average=5675 B
/home/ec2-user/.local/lib/python3.7/site-packages/matplotlib/lines.py:380: size=2597 KiB, count=1205, average=2207 B
/home/ec2-user/.local/lib/python3.7/site-packages/tensorflow/python/util/tf_stack.py:147: size=2546 KiB, count=26034, average=100 B
/home/ec2-user/.local/lib/python3.7/site-packages/matplotlib/transforms.py:179: size=1783 KiB, count=18009, average=101 B
/home/ec2-user/.local/lib/python3.7/site-packages/matplotlib/transforms.py:93: size=1487 KiB, count=16326, average=93 B
<frozen importlib._bootstrap_external>:525: size=1171 KiB, count=10792, average=111 B
/home/ec2-user/.local/lib/python3.7/site-packages/matplotlib/artist.py:75: size=1170 KiB, count=2859, average=419 B
/usr/lib64/python3.7/contextlib.py:82: size=791 KiB, count=5773, average=140 B
/home/ec2-user/.local/lib/python3.7/site-packages/tensorflow/python/util/tf_stack.py:131: size=608 KiB, count=22225, average=28 B
/home/ec2-user/.local/lib/python3.7/site-packages/matplotlib/cbook/__init__.py:795: size=565 KiB, count=61, average=9483 B
/home/ec2-user/.local/lib/python3.7/site-packages/tensorflow/python/util/tf_stack.py:136: size=552 KiB, count=20184, average=28 B
/home/ec2-user/.local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py:365: size=498 KiB, count=2520, average=202 B
/usr/lib64/python3.7/abc.py:143: size=462 KiB, count=3773, average=125 B
<__array_function__ internals>:6: size=342 KiB, count=6058, average=58 B
/home/ec2-user/.local/lib/python3.7/site-packages/matplotlib/transforms.py:180: size=317 KiB, count=6156, average=53 B
/home/ec2-user/.local/lib/python3.7/site-packages/numpy/core/_asarray.py:85: size=294 KiB, count=3926, average=77 B
/home/ec2-user/.local/lib/python3.7/site-packages/cycler.py:227: size=278 KiB, count=3253, average=87 B