0

I'm trying to run a training session using keras the total dataset(1000 80 x 80 images) is very small (20 mb in total) on Amazon ec2 free tier cloud instance(1GB of memory) however The process gets killed after running model.fit() 2 epochs(and it varies sometimes it keeps running up to 15). I'm trying to disable the oom killer or find some workarounds ... any suggestions? You'll find below the memory trace(which does not show some serious figures so i'm wondering why the script gets killed in the first place???)

Error:(reproducible on a 1GB memory instance)

 64/870 [=>............................] - ETA: 12s - loss: 0.4477 - accuracy: 0.8750Traceback (most recent call last):
  File "image_classifier.py", line 990, in <module>
    clf.predict_folder_k_cnn(folder_path='test_photos_2/', label='One', epochs=50)
  File "image_classifier.py", line 951, in predict_folder_k_cnn
    model.fit(self.x_train, self.y_train, epochs=epochs, batch_size=batch_size, **(model_fit_args or {}))
  File "/home/ec2-user/.local/lib/python3.7/site-packages/keras/engine/training.py", line 1239, in fit
    validation_freq=validation_freq)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/keras/engine/training_arrays.py", line 196, in fit_loop
    outs = fit_function(ins_batch)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/tensorflow/python/keras/backend.py", line 3510, in __call__
    outputs = self._graph_fn(*converted_inputs)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 572, in __call__
    return self._call_flat(args)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 671, in _call_flat
    outputs = self._inference_function.call(ctx, args)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 445, in call
    ctx=ctx)
  File "/home/ec2-user/.local/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.ResourceExhaustedError:  OOM when allocating tensor with shape[64,80,80,32] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
         [[node gradients/max_pool/MaxPool_grad/MaxPoolGrad (defined at /home/ec2-user/.local/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:3009) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
 [Op:__inference_keras_scratch_graph_1638]

Function call stack:
keras_scratch_graph

dmesg output:

t:15988kB managed:15904kB mlocked:0kB kernel_stack:0kB pagetables:16kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  504.825883] lowmem_reserve[]: 0 932 932 932
[  504.829525] Node 0 DMA32 free:44316kB min:44316kB low:55392kB high:66468kB active_anon:892184kB inactive_anon:256kB active_file:24kB inactive_file:0kB unevictable:0kB writepending:0kB present:1032192kB managed:991368kB mlocked:0kB kernel_stack:1952kB pagetables:7124kB bounce:0kB free_pcp:120kB local_pcp:120kB free_cma:0kB
[  504.851094] lowmem_reserve[]: 0 0 0 0
[  504.854427] Node 0 DMA: 10*4kB (UME) 11*8kB (UME) 13*16kB (UME) 15*32kB (UE) 9*64kB (UE) 8*128kB (UME) 6*256kB (UME) 1*512kB (E) 0*1024kB 0*2048kB 0*4096kB = 4464kB
[  504.865932] Node 0 DMA32: 1101*4kB (UE) 781*8kB (UE) 458*16kB (UE) 317*32kB (UE) 121*64kB (UME) 46*128kB (UME) 6*256kB (U) 0*512kB 1*1024kB (M) 0*2048kB 0*4096kB = 44316kB
[  504.877626] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  504.884964] 103 total pagecache pages
[  504.888296] 0 pages in swap cache
[  504.891399] Swap cache stats: add 0, delete 0, find 0/0
[  504.895970] Free swap  = 0kB
[  504.898881] Total swap = 0kB
[  504.901907] 262045 pages RAM
[  504.904737] 0 pages HighMem/MovableOnly
[  504.908299] 10227 pages reserved
[  504.911383] 0 pages hwpoisoned
[  504.914445] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[  504.921393] [ 1931]     0  1931    10278       97      28       3        0             0 systemd-journal
[  504.928934] [ 1961]     0  1961    29191       67      28       4        0             0 lvmetad
[  504.936328] [ 2655]     0  2655    16041      149      30       3        0         -1000 auditd
[  504.943150] [ 2683]    81  2683    15123      118      35       3        0          -900 dbus-daemon
[  504.950385] [ 2686]    32  2686    18423      178      38       3        0             0 rpcbind
[  504.957604] [ 2690]   999  2690     3152       41      12       3        0             0 lsmd
[  504.964760] [ 2691]     0  2691     3274       28      12       3        0             0 rngd
[  504.972138] [ 2693]     0  2693     7117       89      19       3        0             0 systemd-logind
[  504.979632] [ 2700]   997  2700    30649      135      33       3        0             0 chronyd
[  504.987111] [ 2716]     0  2716    24457      163      35       3        0             0 gssproxy
[  504.994331] [ 2920]     0  2920    25156      514      48       3        0             0 dhclient
[  505.001383] [ 2961]     0  2961    25156      510      48       3        0             0 dhclient
[  505.008709] [ 3105]     0  3105    22545      262      44       3        0             0 master
[  505.015992] [ 3109]    89  3109    22567      253      44       3        0             0 pickup
[  505.022854] [ 3110]    89  3110    22586      256      46       3        0             0 qmgr
[  505.029730] [ 3157]     0  3157   117174      442      30       6        0             0 amazon-ssm-agen
[  505.037492] [ 3159]     0  3159    54140      270      41       3        0             0 rsyslogd
[  505.044641] [ 3199]     0  3199    30322       32      12       3        0             0 agetty
[  505.051767] [ 3200]     0  3200     2634       33      11       3        0             0 agetty
[  505.059124] [ 3333]     0  3333    38138      334      76       3        0             0 sshd
[  505.066299] [ 3371]     0  3371     1065       26       8       3        0             0 acpid
[  505.073401] [ 3414]  1000  3414    38175      390      73       3        0             0 sshd
[  505.082220] [ 3415]  1000  3415    31219      269      16       3        0             0 bash
[  505.089459] [ 3564]     0  3564    11355      132      24       3        0         -1000 systemd-udevd
[  505.097212] [ 4261]     0  4261    28182      254      59       4        0         -1000 sshd
[  505.103965] [ 4396]     0  4396    33767      158      21       4        0             0 crond
[  505.110852] [ 4421]     0  4421     6968       50      19       3        0             0 atd
[  505.118310] [22988]  1000 22988    33586       64      21       3        0             0 screen
[  505.125710] [22989]  1000 22989    33621      128      19       3        0             0 screen
[  505.132826] [22990]  1000 22990    31215      270      16       3        0             0 bash
[  505.140153] [23011]  1000 23011   568549   219738     812       5        0             0 python3
[  505.147922] Out of memory: Kill process 23011 (python3) score 875 or sacrifice child
[  505.154309] Killed process 23011 (python3) total-vm:2274196kB, anon-rss:878952kB, file-rss:0kB, shmem-rss:0kB
[  505.195909] oom_reaper: reaped process 23011 (python3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ec2-user@ip-172-31-95-14 ~]$ python3 image_classifier.py 

Memory trace(1 epoch):

image_classifier.py:263: size=159 MiB, count=3, average=53.1 MiB
/home/ec2-user/.local/lib/python3.7/site-packages/tables/atom.py:1224: size=20.1 MiB, count=3715, average=5675 B
/home/ec2-user/.local/lib/python3.7/site-packages/matplotlib/lines.py:380: size=2597 KiB, count=1205, average=2207 B
/home/ec2-user/.local/lib/python3.7/site-packages/tensorflow/python/util/tf_stack.py:147: size=2546 KiB, count=26034, average=100 B
/home/ec2-user/.local/lib/python3.7/site-packages/matplotlib/transforms.py:179: size=1783 KiB, count=18009, average=101 B
/home/ec2-user/.local/lib/python3.7/site-packages/matplotlib/transforms.py:93: size=1487 KiB, count=16326, average=93 B
<frozen importlib._bootstrap_external>:525: size=1171 KiB, count=10792, average=111 B
/home/ec2-user/.local/lib/python3.7/site-packages/matplotlib/artist.py:75: size=1170 KiB, count=2859, average=419 B
/usr/lib64/python3.7/contextlib.py:82: size=791 KiB, count=5773, average=140 B
/home/ec2-user/.local/lib/python3.7/site-packages/tensorflow/python/util/tf_stack.py:131: size=608 KiB, count=22225, average=28 B
/home/ec2-user/.local/lib/python3.7/site-packages/matplotlib/cbook/__init__.py:795: size=565 KiB, count=61, average=9483 B
/home/ec2-user/.local/lib/python3.7/site-packages/tensorflow/python/util/tf_stack.py:136: size=552 KiB, count=20184, average=28 B
/home/ec2-user/.local/lib/python3.7/site-packages/tensorflow/python/framework/ops.py:365: size=498 KiB, count=2520, average=202 B
/usr/lib64/python3.7/abc.py:143: size=462 KiB, count=3773, average=125 B
<__array_function__ internals>:6: size=342 KiB, count=6058, average=58 B
/home/ec2-user/.local/lib/python3.7/site-packages/matplotlib/transforms.py:180: size=317 KiB, count=6156, average=53 B
/home/ec2-user/.local/lib/python3.7/site-packages/numpy/core/_asarray.py:85: size=294 KiB, count=3926, average=77 B
/home/ec2-user/.local/lib/python3.7/site-packages/cycler.py:227: size=278 KiB, count=3253, average=87 B
  • 1
    Try adding a swap file, but if you are out of memory (OOM) there is not much you can do other than upgrading the instance. – Selcuk Feb 12 '20 at 03:56
  • @Selcuk can you show me how to add a swap file? –  Feb 12 '20 at 04:00
  • It depends on the distribution, you may have better luck if you ask this question on https://unix.stackexchange.com/ – Selcuk Feb 12 '20 at 05:23
  • I'm voting to close this question as off-topic because it belongs on https://unix.stackexchange.com/ – Selcuk Feb 12 '20 at 05:23
  • @Selcuk yeah, that's very helpful specially everyone who's experienced with unix is definitely experienced with python, completely solves the problem, brilliant! –  Feb 12 '20 at 05:30
  • Your problem has nothing to do with Python. You are trying to run a memory-intensive software on a tiny EC2 instance, which is not a programming question. Adding a swap file/partition to a Unix installation is something the people on that site are experienced with. – Selcuk Feb 12 '20 at 07:13
  • 1
    @Selcuk I already added a swap file and it did the trick thanks anyway. –  Feb 12 '20 at 07:40

0 Answers0