5

My Java application on an single cpu arm7 (32bit) device using Java 14 is occasionally crashing after running under load for a number of hours, and is always failing in ThreadLocalAllocBuffer::resize()

  A fatal error has been detected by the Java Runtime Environment:
    #
    SIGSEGV (0xb) at pc=0xb6cd515e, pid=1725, tid=1733
    #
    JRE version: OpenJDK Runtime Environment (14.0+36) (build 14+36)
    Java VM: OpenJDK Client VM (14+36, mixed mode, serial gc, linux-arm)
    Problematic frame:
    V
    #
    No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
    #
    If you would like to submit a bug report, please visit:
    https://bugreport.java.com/bugreport/crash.jsp
    #

--------------- S U M M A R Y ------------

Command Line: -Duser.home=/mnt/app/share/log -Djdk.lang.Process.launchMechanism=vfork -Xms150m -Xmx900m -Dcom.mchange.v2.log.MLog=com.mchange.v2.log.jdk14logging.Jdk14MLog -Dorg.jboss.logging.provider=jdk -Djava.util.logging.config.class=com.jthink.songkong.logging.StandardLogging --add-opens=java.base/java.lang=ALL-UNNAMED lib/songkong-6.9.jar -r

Host: Marvell PJ4Bv7 Processor rev 1 (v7l), 1 cores, 1G, Buildroot 2014.11-rc1
Time: Fri Apr 24 19:36:54 2020 BST elapsed time: 37456 seconds (0d 10h 24m 16s)


    --------------- T H R E A D ---------------

    Current thread (0xb6582a30): VMThread "VM Thread" [stack: 0x7b716000,0x7b796000] [id=3625] _threads_hazard_ptr=0x7742f140

    Stack: [0x7b716000,0x7b796000], sp=0x7b7946b0, free space=505k
    Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)
    V [libjvm.so+0x48015e] ThreadLocalAllocBuffer::resize()+0x85

    [error occurred during error reporting (printing native stack), id 0xb, SIGSEGV (0xb) at pc=0xb6b4ccae]

Now this must surely be bug in JVM, but as its not one of the standard Java platforms and I dont have a simple test case I cannot see it getting fixed anytime soon, so I am trying to workaround it. Its also worth noting that it crashed with ThreadLocalAllocBuffer::accumulate_statistics_before_gc() when I used Java 11 which is why I moved to Java 14 to try and resolve the issue.

As the the issue is with TLABs one solution is to disable TLABS with -XX:-UseTLAB but that makes the code run slower on an already slow machine.

So I think another solution is to disable resizing with -XX:-ResizeTLAB, but then I need to know work out a suitable size and specify that using -XX:TLABSize=N. But I am not sure what N actually represents and what would be a suitable size to set

I tried setting -XX:TLABSize=1000000 which seems to me to be quite large ?

I have some logging set with

 -Xlog:tlab*=debug,tlab*=trace:file=gc.log:time:filecount=7,filesize=8M

but I don't really understand the output.

[2020-05-19T15:43:43.836+0100] ThreadLocalAllocBuffer::compute_size(132) returns 250132
[2020-05-19T15:43:43.837+0100] TLAB: fill thread: 0x0026d548 [id: 871] desired_size: 976KB slow allocs: 0  refill waste: 15624B alloc: 0.25725     1606KB refills: 1 waste  0.0% gc: 0B slow: 0B fast: 0B
[2020-05-19T15:43:43.853+0100] ThreadLocalAllocBuffer::compute_size(6) returns 250006
[2020-05-19T15:43:43.854+0100] TLAB: fill thread: 0xb669be48 [id: 32635] desired_size: 976KB slow allocs: 0  refill waste: 15624B alloc: 0.00002        0KB refills: 1 waste  0.0% gc: 0B slow: 0B fast: 0B
[2020-05-19T15:43:43.910+0100] ThreadLocalAllocBuffer::compute_size(4) returns 250004
[2020-05-19T15:43:43.911+0100] TLAB: fill thread: 0x76c1d6f8 [id: 917] desired_size: 976KB slow allocs: 0  refill waste: 15624B alloc: 0.91261     8085KB refills: 1 waste  0.0% gc: 0B slow: 0B fast: 0B
[2020-05-19T15:43:43.962+0100] ThreadLocalAllocBuffer::compute_size(2052) returns 252052
[2020-05-19T15:43:43.962+0100] TLAB: fill thread: 0x76e06f10 [id: 534] desired_size: 976KB slow allocs: 4  refill waste: 15688B alloc: 0.13977     1612KB refills: 2 waste  0.2% gc: 0B slow: 4520B fast: 0B
[2020-05-19T15:43:43.982+0100] ThreadLocalAllocBuffer::compute_size(28878) returns 278878
[2020-05-19T15:43:43.983+0100] TLAB: fill thread: 0x76e06f10 [id: 534] desired_size: 976KB slow allocs: 4  refill waste: 15624B alloc: 0.13977     1764KB refills: 3 waste  0.3% gc: 0B slow: 10424B fast: 0B
[2020-05-19T15:43:44.023+0100] ThreadLocalAllocBuffer::compute_size(4) returns 250004
[2020-05-19T15:43:44.023+0100] TLAB: fill thread: 0x7991df20 [id: 32696] desired_size: 976KB slow allocs: 0  refill waste: 15624B alloc: 0.00132       19KB refills: 1 waste  0.0% gc: 0B slow: 0B fast: 0B

Update

I reran with -XX:+HeapDumpOnOutOfMemoryError option added, and this time it showed:

java.lang.OutOfMemoryError: Java heap space
Dumping heap to java_pid1600.hprof ...

but then the dump itself failed with

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0xb6a81b9a, pid=1600, tid=1606
#
# JRE version: OpenJDK Runtime Environment (14.0+36) (build 14+36)
# Java VM: OpenJDK Client VM (14+36, mixed mode, serial gc, linux-arm)
# Problematic frame:
# V  [libjvm.so+0x22eb9a]  DumperSupport::dump_field_value(DumpWriter*, char, oopDesc*, int)+0x91
#
# No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /mnt/system/config/Apps/SongKong/songkong/hs_err_pid1600.log
#
# If you would like to submit a bug report, please visit:
#   https://bugreport.java.com/bugreport/crash.jsp

I am not clear if the dump failed because of ulimit or soemthing else, but java_pid1600.hprof was created but was empty

I was also monitoring the process with jstat -gc, and jstat -gcutil. I paste the end of the putput here, to me it does not look like there was a particular memory problem before the crash, although I am only checking every 5 seconds so maybe that is the issue ?

[root@N1-0247 bin]# ./jstat -gc 1600 5s

 S0C    S1C    S0U    S1U      EC       EU        OC         OU       MC     MU    CCSC   CCSU   YGC     YGCT    FGC    FGCT    CGC    CGCT     GCT
........
30720.0 30720.0  0.0    0.0   245760.0 236647.2  614400.0   494429.2  50136.0 49436.9  0.0    0.0     5084 3042.643  155   745.523   -          - 3788.166
30720.0 30720.0  0.0   28806.1 245760.0 244460.2  614400.0   506541.7  50136.0 49436.9  0.0    0.0     5085 3043.887  156   745.523   -          - 3789.410
30720.0 30720.0 28760.4  0.0   245760.0 245760.0  614400.0   514809.7  50136.0 49437.2  0.0    0.0     5086 3044.895  157   751.204   -          - 3796.098
30720.0 30720.0  0.0   231.1  245760.0 234781.8  614400.0   514809.7  50136.0 49437.2  0.0    0.0     5087 3044.895  157   755.042   -          - 3799.936
30720.0 30720.0  0.0    0.0   245760.0 190385.5  614400.0   519650.7  50136.0 49449.6  0.0    0.0     5087 3045.905  159   758.890   -          - 3804.795
30720.0 30720.0  0.0    0.0   245760.0 190385.5  614400.0   519650.7  50136.0 49449.6  0.0    0.0     5087 3045.905  159   758.890   -          - 3804.795

[root@N1-0247 bin]# ./jstat -gc 1600 5s
     S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT    CGC    CGCT     GCT
..............
     99.70   0.00 100.00  75.54  98.56      -   5080 3037.321   150  724.674     -        - 3761.995
      0.00  29.93  99.30  75.55  98.56      -   5081 3038.403   151  728.584     -        - 3766.987
      0.00 100.00  99.30  75.94  98.56      -   5081 3039.405   152  728.584     -        - 3767.989
    100.00   0.00  99.14  76.14  98.56      -   5082 3040.366   153  734.088     -        - 3774.454
      0.00  96.58  99.87  78.50  98.57      -   5083 3041.366   154  737.960     -        - 3779.325
     56.99   0.00 100.00  78.50  98.58      -   5084 3041.366   154  741.880     -        - 3783.246
      0.00   0.00  96.29  80.47  98.61      -   5084 3042.643   155  745.523     -        - 3788.166
      0.00  93.77  99.47  82.44  98.61      -   5085 3043.887   156  745.523     -        - 3789.410
     93.62   0.00 100.00  83.79  98.61      -   5086 3044.895   157  751.204     -        - 3796.098
      0.00   0.76  95.53  83.79  98.61      -   5087 3044.895   157  755.042     -        - 3799.936
      0.00   0.00  77.47  84.58  98.63      -   5087 3045.905   159  758.890     -        - 3804.795
      0.00   0.00  77.47  84.58  98.63      -   5087 3045.905   159  758.890     -        - 3804.795

Update Latest run

Configured gclogging, i get many

Pause Young (Allocation Failure)

errors, does this indicate I need to make the eden space larger?

[2020-05-29T14:00:22.668+0100] GC(44) Pause Young (GCLocker Initiated GC)
[2020-05-29T14:00:22.739+0100] GC(44) DefNew: 43230K(46208K)->4507K(46208K) Eden: 41088K(41088K)->0K(41088K) From: 2142K(5120K)->4507K(5120K)
[2020-05-29T14:00:22.739+0100] GC(44) Tenured: 50532K(102400K)->50532K(102400K)
[2020-05-29T14:00:22.740+0100] GC(44) Metaspace: 40054K(40536K)->40054K(40536K)
[2020-05-29T14:00:22.740+0100] GC(44) Pause Young (GCLocker Initiated GC) 91M->53M(145M) 72.532ms
[2020-05-29T14:00:22.741+0100] GC(44) User=0.07s Sys=0.00s Real=0.07s
[2020-05-29T14:00:25.196+0100] GC(45) Pause Young (Allocation Failure)
[2020-05-29T14:00:25.306+0100] GC(45) DefNew: 45595K(46208K)->2150K(46208K) Eden: 41088K(41088K)->0K(41088K) From: 4507K(5120K)->2150K(5120K)
[2020-05-29T14:00:25.306+0100] GC(45) Tenured: 50532K(102400K)->53861K(102400K)
[2020-05-29T14:00:25.307+0100] GC(45) Metaspace: 40177K(40664K)->40177K(40664K)
[2020-05-29T14:00:25.307+0100] GC(45) Pause Young (Allocation Failure) 93M->54M(145M) 111.252ms
[2020-05-29T14:00:25.308+0100] GC(45) User=0.08s Sys=0.02s Real=0.11s
[2020-05-29T14:00:29.248+0100] GC(46) Pause Young (Allocation Failure)
[2020-05-29T14:00:29.404+0100] GC(46) DefNew: 43238K(46208K)->4318K(46208K) Eden: 41088K(41088K)->0K(41088K) From: 2150K(5120K)->4318K(5120K)
[2020-05-29T14:00:29.405+0100] GC(46) Tenured: 53861K(102400K)->53861K(102400K)
[2020-05-29T14:00:29.405+0100] GC(46) Metaspace: 40319K(40792K)->40319K(40792K)
[2020-05-29T14:00:29.406+0100] GC(46) Pause Young (Allocation Failure) 94M->56M(145M) 157.614ms
[2020-05-29T14:00:29.406+0100] GC(46) User=0.07s Sys=0.00s Real=0.16s
[2020-05-29T14:00:36.466+0100] GC(47) Pause Young (Allocation Failure)
[2020-05-29T14:00:36.661+0100] GC(47) DefNew: 45406K(46208K)->5120K(46208K) Eden: 41088K(41088K)->0K(41088K) From: 4318K(5120K)->5120K(5120K)
[2020-05-29T14:00:36.662+0100] GC(47) Tenured: 53861K(102400K)->55125K(102400K)
[2020-05-29T14:00:36.662+0100] GC(47) Metaspace: 40397K(40920K)->40397K(40920K)
[2020-05-29T14:00:36.663+0100] GC(47) Pause Young (Allocation Failure) 96M->58M(145M) 196.531ms
[2020-05-29T14:00:36.663+0100] GC(47) User=0.09s Sys=0.01s Real=0.19s
[2020-05-29T14:00:40.523+0100] GC(48) Pause Young (Allocation Failure)
[2020-05-29T14:00:40.653+0100] GC(48) DefNew: 44274K(46208K)->2300K(46208K) Eden: 39154K(41088K)->0K(41088K) From: 5120K(5120K)->2300K(5120K)
[2020-05-29T14:00:40.653+0100] GC(48) Tenured: 55125K(102400K)->59965K(102400K)
[2020-05-29T14:00:40.654+0100] GC(48) Metaspace: 40530K(41048K)->40530K(41048K)
[2020-05-29T14:00:40.654+0100] GC(48) Pause Young (Allocation Failure) 97M->60M(145M) 131.365ms
[2020-05-29T14:00:40.655+0100] GC(48) User=0.11s Sys=0.01s Real=0.14s
[2020-05-29T14:00:43.936+0100] GC(49) Pause Young (Allocation Failure)
[2020-05-29T14:00:44.100+0100] GC(49) DefNew: 43388K(46208K)->5120K(46208K) Eden: 41088K(41088K)->0K(41088K) From: 2300K(5120K)->5120K(5120K)

Updated with gc analysis done by gceasy

Okay so this is useful I uploaded log to gceasy.org and it clearly shows that shortly before it crashed heap size was significantly higher and approaching the 900mb limit,even after a number of full gcs, so I think basically it ran out of heap space.

enter image description here

What is a little frustrating is I have the

-XX:+HeapDumpOnOutOfMemoryError

option enabled, but when it crashes it reports an issue trying to do create the dump file so I cannot get one.

And when I process the same file on Windows with the same setting for heap size it suceeds without failure, But Im goinf to run again ewith gclogging enabled and see if it reaches simailr levels even if it doesnt actually fall over.

Ran again (this is building on chnages made in previous run and doesnt show start of run) but to me the memory usage is higher but looks quite normal (sawtooth pattern) with no particular differenc ebefore the crash.

enter image description here

Update

With last run I reduced max heap from 900MB to 600MB, but I also monitored with vmstat, Yo can see clearly below where the applciation crashed but It doesn't seem we were approaching particularly ow memory at this point.

        procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
     3  0      0  57072   7812 1174128    0    0  5360     0  211  558 96  4  0  0  0
     1  0      0  55220   7812 1176184    0    0  2048     0  203  467 79 21  0  0  0
     3  0      0  61296   7812 1169096    0    0  2036    44  193  520 96  4  0  0  0
     2  0      0  59808   7812 1171144    0    0  2048    32  212  522 96  4  0  0  0
     1  0      0  59436   7812 1171144    0    0     0     0  180  307 83 17  0  0  0
     1  0      0  59436   7812 1171144    0    0     0     0  179  173 100  0  0  0  0
     1  0      0  59436   7812 1171128    0    0     0     0  179  184 100  0  0  0  0
     2  1      0  51764   7816 1158452    0    0  4124    52  190  490 80 20  0  0  0
     3  0      0  63428   7612 1146388    0    0 20472    48  251  533 86 14  0  0  0
     2  0      0  63428   7616 1146412    0    0     4     0  196  508 99  1  0  0  0
     2  0      0  84136   7616 1146400    0    0     0     0  186  461 84 16  0  0  0
     2  0      0  61436   7608 1148960    0    0 24601     0  325  727 77 23  0  0  0
     4  0      0  60196   7648 1150204    0    0  1160    76  232  611 98  2  0  0  0
     4  0      0  59204   7656 1151052    0    0    52   376  305  570 80 20  0  0  0
     3  0      0  59204   7656 1151052    0    0     0     0  378  433 96  4  0  0  0
     1  0      0 762248   7768 1151420    0    0   106     0  253  660 74 26  0  0  0
     0  0      0 859272   8188 1151892    0    0   417     0  302  550  9 26 64  1  0
     0  0      0 859272   8188 1151892    0    0     0     0  111  132  0  0 100  0  0
John Kugelman
  • 349,597
  • 67
  • 533
  • 578
Paul Taylor
  • 13,411
  • 42
  • 184
  • 351
  • An ancient blog on the calculation of the dynamic TLAB size per thread might give some pointers to how the usual sizes are calculated: https://blogs.oracle.com/jonthecollector/the-real-thing and here: https://dzone.com/articles/thread-local-allocation-buffers in the comments the suggestion to use Java Flight Recorder for monitoring the effectiveness of your TLAB size from a performance perspective – JohannesB May 27 '20 at 20:50
  • 1
    It’s very ambitious to try to work-around a bug in a software just by setting some options, when it’s not really clear whether the bug truly is in that function or the function just showed the symptoms. The last line is a big alarming sign: `error occurred during error reporting (printing native stack), id 0xb, SIGSEGV (0xb) at pc=0xb6b4ccae]` as it indicates that not just the function made an illegal action but rather, the JVM seems to be in an illegal state, producing SIGSEGVs at entirely unrelated places. Which suggests that you are looking at symptoms only. – Holger May 28 '20 at 10:49
  • 1
    But besides that, you can use `-XX:+PrintFlagsFinal`, to see the value for `MinTLABSize` in your setup, which gives you a hint about the order of magnitude for the `TLABSize` option. – Holger May 28 '20 at 10:51
  • Holger I know but it happens at seemingly random time after running for a long period of time, and JVMs shouldnt sigsev hopwever bad the code is, so really how am i meant to track what is causing it ? – Paul Taylor May 28 '20 at 11:13
  • 1
    I know that the cause is hard to track (if not even impossible), especially in a software you didn’t write. If I had a better advice, I wrote it. You can use the options mentioned in [my other comment](https://stackoverflow.com/questions/61893915/java-how-do-i-decide-setting-a-suitable-tlabsize-for-java-application#comment109767355_61893915) and perhaps, a fixed TLAB size truly helps, but I’m not very optimistic… – Holger May 28 '20 at 12:23
  • ok, wierdly i was monitoring the programs garbagecollection/memory allocation with ./jstat -gc pid 5s and after 18 hours jstat itself failed with segmentation fault (the applicatincontinued to work and I could rerun jstat) – Paul Taylor May 28 '20 at 15:53
  • @Holger I paste the output of last failed run, now it failed trying to do memory dump ! – Paul Taylor May 29 '20 at 04:53
  • 1
    I’m concerned about the apparent need to specify `--add-opens=java.base/java.lang=ALL-UNNAMED`. Besides that, you should really also consider the other possibilities: 1) true hardware problem and 2) operating system issue. – Holger May 29 '20 at 06:42
  • @Holger that just due to the move from Java 8 to Java 9, a couple of libs had problems with the module approach – Paul Taylor May 29 '20 at 09:05
  • 1
    …and these problems or the necessity to specify `--add-opens=java.base/java.lang=ALL-UNNAMED` imply that these libraries are accessing internals of the `java.lang` package. Internals that could have changed between the Java versions in a way that these accesses now compromise the stability of the JVM. We’re talking about *the `java.lang` package*… – Holger May 29 '20 at 10:10
  • But it works okay on all other platforms with same code, from memory it was a problem with Hibernate but other issues prevented me updating Hibernate. Im looking at removing HIbernate I dont think it is benefitting me but that is not a quick fix. – Paul Taylor May 29 '20 at 10:59
  • What else is running on the same host? You mentioned Hibernate - is the database also running on the same host? Your error log shows that you use the jdk.lang.Process.launchMechanism system property. - is your java process forking other processes? – rmunge Jun 03 '20 at 01:23
  • Nothing, hibernate is running against h2, so all within the single JVM. I use jdk.lang.Process.launchMechanism just because of an issue I had updating from Java 11 to Java 14 – Paul Taylor Jun 03 '20 at 06:44

2 Answers2

0

I think you may already be on the wrong track:

It is more likely that your process has a general problem with allocating memory than that there are two different bugs in two different Java versions.

Have you already checked whether the process has enough memory? A segmentation fault can also occur when the process runs out of memory. I would also check the configuration of the swap file. Years ago I got inexplicable segfaults with Java 8 also somewhere in a resize or allocation method. In my case the size of the OS's swap file was set to zero.

What error do you see on top of the error log file? You only copied the information of the single thread.

UPDATE

You definitely do not have a problem with GC. If GC would be overloaded you would some when get an java.lang.OutOfMemoryError with the message:

GC Overhead limit exceeded

GC tries to collect garbage but it also has CPU constraints. Concrete behavior depends on the actual GC implementation but usually garbage will accumulate (see your big OldGen) before the GC uses more CPU cycles. So an increased heap usage is completely normal as long as you do not get the mentioned OOM error.

The segmentation faults in the native code are an indicator that there's something wrong with accessing native memory. You even get segmentation faults when the JVM tries to generate a dump. This is an additional indicator for a general problem with accessing native memory.

What's still unanswered is whether you really have enough native memory for all the processes running on your host.

Linux's overcommitment of memory usually triggers the OOM killer. But there are situations where the OOM killer is not triggered (see the kernel documentation for details). In such cases it is possible that a process may die with a SIGSEGV. Like other native applications also the JVM makes use of mmap. Also the man pages of mmap mention that depending on the used parameters a SIGSEGV may occur upon a write if no physical memory is available.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
rmunge
  • 3,653
  • 5
  • 19
  • Well the application has never crashed on any other environment (windows, macos, standard linux) and never crashed when using Java 9. It is possible that it is running out of memory but then an OutOfMemoryError should be thrown the JVM should not just crash, could be related to swap perhaps, not sure how I check this ? – Paul Taylor May 21 '20 at 07:10
  • Since the OS may allow memory overcommit you can run into situations where you do not get an out of memory error. You haven't mentioned your OS. So I assume you're using Linux. Check in which overcommit mode you're OS is running with this command: sysctl vm.overcommit_memory. Checking swap file configuration depends on your OS. Your hardware description let me assume that you're running on a Raspi. If you should use Raspian the following command shows you the status of the swap file configuration: sudo service dphys-swapfile status – rmunge May 21 '20 at 08:19
  • Not using Rasbperrypi, I am using Linux 3.3.4-ROMROOT-NT1, the cmd sysctl vm.overcommit_memory returns vm.overcommit_memory = 0 – Paul Taylor May 21 '20 at 09:15
  • Check swap size with free -h. If swap size is 0 then swapping is disabled. You could also temporarily disable overcomittment by changing vm.overcommit_memory to 2 and see whether you get an out of memory error instead of the segfault. – rmunge May 22 '20 at 05:24
  • ok swap is disabled, free -h shows 1.2GB is being used, I have set Heap size to 900mb, so is 1.2GB the total memory actually be used by my app (Heap + Metaspace+ThreadstackSpace etc) or the max that could be used if using all 900MB of heap ? – Paul Taylor May 22 '20 at 08:53
  • Java allocates memory for the heap independent from actual usage. The heap is only the memory reserved for Java objects but the JVM requires additional memory: memory for each started thread, memory for JIT compilation (code cache), memory for native libraries accessed through JNI, off-heap memory, allocated by direct or mapped ByteBuffers and last but not least memory used by the garbage collector. – rmunge May 23 '20 at 10:22
  • How much physical memory do you have? – rmunge May 24 '20 at 10:11
  • I have 2GB of memory – Paul Taylor May 26 '20 at 06:07
  • If the Linux kernel would face extreme memory shortage shouldn't the OOM killer have kicked in? (Did you check the kernel logs with e.g.: `dmesg` or monitor the health of the systems memory usage with something like sar (system account and reporting)?) I'm no expert on this but couldn't the hardware be faulty? Is it reproducable on other boards? You might want to consider running some memory testing, not that that offers any guarantees: https://unix.stackexchange.com/questions/38887/how-can-i-test-the-ram-for-data-corruption-on-an-arm-based-system – JohannesB May 26 '20 at 09:40
  • I have reproduced on another machine to check this and managed to reproduce so not hardware issue. Im trying to monitor with jstat but had a few false starts – Paul Taylor May 27 '20 at 06:32
  • I would suggest to monitor free memory also while your program is running. Note that there may also be other processes running that eat memory. Additional processes may also be started temporarily e.g. through cron jobs. If you shouldn't have a monitoring tool at your hands, you could use _free -n_ to print free memory every n seconds. – rmunge May 30 '20 at 14:43
  • @PaulTaylor are you really sure that you have 2 GB of physical memory? Your first error log in your question contains the following line: _Marvell PJ4Bv7 Processor rev 1 (v7l), 1 cores, 1G, Buildroot 2014.11-rc1_ ... the 1G would be more an indicator for 1 GB of physical memory than 2 GB – rmunge May 30 '20 at 17:22
  • @PaulTaylor Can you please verify available amount of physical memory with the following command? _grep MemTotal /proc/meminfo_ – rmunge May 30 '20 at 17:29
  • @munge Im sure, see comments https://stackoverflow.com/questions/60095093/java-9-0-1-openjdk-crashed-the-jvm-with-problem-with-adaptiveweightedaverage a colleague looked into code and found the way it is calcualated rounded it down. But anyway the command you request fices - MemTotal: 2074812 kB – Paul Taylor May 31 '20 at 06:41
0

Based on your jstat data and their explanation here: https://docs.oracle.com/en/java/javase/11/tools/jstat.html#GUID-5F72A7F9-5D5A-4486-8201-E1D1BA8ACCB5

I would not expect OutOfMemoryError just yet from the HeapSpace based on the slow and steady rate of the Old Generation filling up and the small size of the from and to space (not that I know whether your application might allocate a huge array anytime soon) unless:

  • initial heap size (-Xms) is smaller than the max (-Xmx) and
  • Linux has overcomitted virtual memory

If you do overcommit (and who doesn't) maybe you should keep an eye on Linux with vmstat 1 or gathering data frequently for sar

But I do wonder why you are not using Garbage Collection logging with -Xlog:gc*:stderr or to a file with -Xlog:gc*:file= and maybe analyze that with https://gceasy.io/ as it is very low overhead (unless writing to the logfile is slow) and very precise. For more information on the logging syntax see: https://openjdk.java.net/jeps/158 and https://openjdk.java.net/jeps/271

java -Xlog:gc*:stderr -jar yourapp.jar

and analyze those logs with great ease with tools like these:

This should give similar information as jstack and more in realtime (as far as I know)

JohannesB
  • 2,214
  • 1
  • 11
  • 18
  • Initial heap size (-Xms) is smaller than the max (-Xmx), do yo think I should try setting it the same, there is no swap on the machine but I cant change that, but I set heap to less than half available memory and little else running so should be ok, I now have vmstat running ? – Paul Taylor May 29 '20 at 09:09
  • Monitoring gc is not something I have had to bother with before so all this new to me, but I was thinking of logging gc so I will try that next. – Paul Taylor May 29 '20 at 09:11
  • Great that you monitoring your Virtual Memory now as that might reveal any issues with the Operating Systems Virtual Memory limits because the Java heap might grow to -Xmx and besides that Java uses native memory for each thread (stack), metaspace, and anything implemented in native methods (JNI) like zip, direct ByteBuffer etc. for efficiency reasons – JohannesB May 29 '20 at 09:26
  • Please also verify the output of `ulimit -v ` and `ulimit -s` and friends: `ulimit -a` for the user that is running Java that it does not limit memory usage, for some explantation see the bash manual page (I assume you are using bash as shell and ulimit is a shell builtin) and : https://man7.org/linux/man-pages/man2/getrlimit.2.html - as they may also cause segmentation faults. – JohannesB May 29 '20 at 09:55
  • umilit -v ulimited, ulimit -s 8192 – Paul Taylor May 29 '20 at 10:52
  • See update to question, should it be gcing young generation so often ? – Paul Taylor May 29 '20 at 13:04
  • It depends on the size and rate of allocation. Whether it is a problem depends on the performance requirements of your application, but you should see it as a trade-off where your application will be able to allocate memory very fast once GC has finished. Young GC is usually the least of your worries as it is a small (and hot) part of the heap and most objects are dead - but yes there is some overhead involved each time like waiting for all threads to have come to a safepoint (the time this takes and when this occurs is also something you can inspect with more logging) and scanning GC roots. – JohannesB May 29 '20 at 13:30
  • Wel thinking about it in simple terms the app processes audio files in batches,one folder =-one batch. The batches are queued up so only processing a few batches at a time, so for a batch objects are created representing the files and other things, and things are written to and from db. But once the batch has been processed almost none of its heap data is needed, the important changes are stored in db and will only be needed for a final report once all the batches are processed. So my point it the app will be continually allocating/deallocating new objects for each batch. – Paul Taylor May 29 '20 at 14:10
  • But few objects need to exist for a long time. i.e if processing 50,000 songs on this machine may take 48 hours, but most of data for a particular batch can be freed within a few minutes of being created. So there are many objects that may exist for a few minutes but few that exist for a few hours so Im just wondering if that mean sthe younger space should be propertionaly larger ? – Paul Taylor May 29 '20 at 14:12
  • This could be done with NewSize or NewRatio but should not be necessary to fix your original problem, but might improve throughput by a fraction of 3% (the current overhead of all young gc according to an analysis of your gc logs). See for more info: https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/sizing.html I expect you eventually are going to ask what pesky objects keep surviving YoungGC and even FullGC and keep filling the old generation, maybe it is time to skip to that step and generate a heap dump eg. with (J)VisualVM and see if you find (m)any unexpected objects. – JohannesB May 29 '20 at 14:43
  • Are you sure you are not running out of Linux memory? Maybe lower the maximum heap size a bit to get a bit of safety margin?, e.g. with 200M unless you have definitive proof that you are nowhere near running out of memory (e.g.: by watching: `while /bin/true; do grep MemAvailable /proc/meminfo; sleep 1; done` as even dumping memory might cost extra memory which not be available anymore) – JohannesB Jun 03 '20 at 18:25
  • HI @JohannesB not sure, I have no /proc/meminfo but if I use vmstat 1 is the 'free' column what I need to watch ? – Paul Taylor Jun 03 '20 at 19:00
  • 1
    Sum(free+cache+buffers) should not get near zero. See also: https://www.linuxatemyram.com/ – JohannesB Jun 03 '20 at 19:25
  • So I ran again with lower maxmem (600M) and vmstat running, and it failed although took longer. vmstat didnt seem to indicate mem was particular low, added to the question. – Paul Taylor Jun 05 '20 at 11:35
  • There a few memory allocations that slowly increase over time. I think the solution is going to be to make code changes to reduce the memory needed, and hope that avoids the issue. – Paul Taylor Jun 05 '20 at 11:42
  • Sure, good luck. If you enjoy exploring more options and as it is related to garbage collection bugs I feel I should mention that you can try switching to a different and newer Garbage Collector (actively under development, feedback appreciated)? like: G1GC, ZGC (both sponsored by Oracle), or Shenandoah (sponsored by RedHat) although they may have some CPU/memory overhead as they are not primarily designed for throughput but for low latency and very large heaps. And they might be even less tested on your platform. – JohannesB Jun 05 '20 at 12:06