1

Recently we started to use OpenJDK JRE instead of Oracle's one. The environment consist of:

  1. Tomcat 8.0.29
  2. CentOS 6.7 (2.6.32-279.el6.imp6.numa.x86_64 #1 SMP Wed Mar 30 11:50:42 IDT 2016 x86_64 x86_64 x86_64 GNU/Linux)
  3. Java 8 update 71
  4. G1 Garbage Collection Algorithm

The server crashes periodically due to "fatal error has been detected by the Java Runtime Environment". I reported about it to java community (JDK-8144331), but they discarded it with "Generally we don't look into OpenJDK issues" claim.

I started to investigate by myself the root cause, I replaced the production RPMs with debug ones and installed the GDB & debuginfo packages. The next errors printed to catalina.out:

== CT verification failed: [0x00007fbc0d051000,0x00007fbc0d0517ff]
==   expecting value: 32
==   card 0x00007fbc0d051639 [0x00000000fdec7200,0x00000000fdec7400], val: 0
==   card 0x00007fbc0d05163a [0x00000000fdec7400,0x00000000fdec7600], val: 0
==   card 0x00007fbc0d05163b [0x00000000fdec7600,0x00000000fdec7800], val: 0
==   card 0x00007fbc0d05163c [0x00000000fdec7800,0x00000000fdec7a00], val: 0
==   card 0x00007fbc0d05163e [0x00000000fdec7c00,0x00000000fdec7e00], val: 0
==   card 0x00007fbc0d05163f [0x00000000fdec7e00,0x00000000fdec8000], val: 0
==   card 0x00007fbc0d051640 [0x00000000fdec8000,0x00000000fdec8200], val: 0
==   card 0x00007fbc0d051641 [0x00000000fdec8200,0x00000000fdec8400], val: 0
==   card 0x00007fbc0d051642 [0x00000000fdec8400,0x00000000fdec8600], val: 0
==   card 0x00007fbc0d051643 [0x00000000fdec8600,0x00000000fdec8800], val: 0
==   card 0x00007fbc0d051644 [0x00000000fdec8800,0x00000000fdec8a00], val: 0
==   card 0x00007fbc0d051645 [0x00000000fdec8a00,0x00000000fdec8c00], val: 0
==   card 0x00007fbc0d051647 [0x00000000fdec8e00,0x00000000fdec9000], val: 0
==   card 0x00007fbc0d051648 [0x00000000fdec9000,0x00000000fdec9200], val: 0
==   card 0x00007fbc0d051649 [0x00000000fdec9200,0x00000000fdec9400], val: 0
==   card 0x00007fbc0d05164a [0x00000000fdec9400,0x00000000fdec9600], val: 0
==   card 0x00007fbc0d05164b [0x00000000fdec9600,0x00000000fdec9800], val: 0
==   card 0x00007fbc0d05164c [0x00000000fdec9800,0x00000000fdec9a00], val: 0
==   card 0x00007fbc0d05164d [0x00000000fdec9a00,0x00000000fdec9c00], val: 0
==   card 0x00007fbc0d05164e [0x00000000fdec9c00,0x00000000fdec9e00], val: 0
==   card 0x00007fbc0d05164f [0x00000000fdec9e00,0x00000000fdeca000], val: 0
==   card 0x00007fbc0d051651 [0x00000000fdeca200,0x00000000fdeca400], val: 0
==   card 0x00007fbc0d051652 [0x00000000fdeca400,0x00000000fdeca600], val: 0
==   card 0x00007fbc0d051655 [0x00000000fdecaa00,0x00000000fdecac00], val: 0
==   card 0x00007fbc0d051657 [0x00000000fdecae00,0x00000000fdecb000], val: 0
==   card 0x00007fbc0d05165d [0x00000000fdecba00,0x00000000fdecbc00], val: 0
==   card 0x00007fbc0d051664 [0x00000000fdecc800,0x00000000fdecca00], val: 0
==   card 0x00007fbc0d05166b [0x00000000fdecd600,0x00000000fdecd800], val: 0
==   card 0x00007fbc0d051672 [0x00000000fdece400,0x00000000fdece600], val: 0
==   card 0x00007fbc0d051679 [0x00000000fdecf200,0x00000000fdecf400], val: 0
==   card 0x00007fbc0d05167a [0x00000000fdecf400,0x00000000fdecf600], val: 0
==   card 0x00007fbc0d05167b [0x00000000fdecf600,0x00000000fdecf800], val: 0
==   card 0x00007fbc0d05167d [0x00000000fdecfa00,0x00000000fdecfc00], val: 0
==   card 0x00007fbc0d05167f [0x00000000fdecfe00,0x00000000fded0000], val: 0
==   card 0x00007fbc0d051680 [0x00000000fded0000,0x00000000fded0200], val: 0
==   card 0x00007fbc0d051681 [0x00000000fded0200,0x00000000fded0400], val: 0
==   card 0x00007fbc0d051683 [0x00000000fded0600,0x00000000fded0800], val: 0
==   card 0x00007fbc0d051685 [0x00000000fded0a00,0x00000000fded0c00], val: 0
==   card 0x00007fbc0d051686 [0x00000000fded0c00,0x00000000fded0e00], val: 0
==   card 0x00007fbc0d051687 [0x00000000fded0e00,0x00000000fded1000], val: 0
==   card 0x00007fbc0d051688 [0x00000000fded1000,0x00000000fded1200], val: 0
==   card 0x00007fbc0d051689 [0x00000000fded1200,0x00000000fded1400], val: 0
==   card 0x00007fbc0d05168b [0x00000000fded1600,0x00000000fded1800], val: 0
# To suppress the following error report, specify this argument
# after -XX: or in .hotspotrc:  SuppressErrorAt=/cardTableModRefBS.cpp:687
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (/builddir/build/BUILD/java-1.8.0-openjdk-1.8.0.71-1.b15.el6_7.x86_64/openjdk/hotspot/src/share/vm/memory/cardTableModRefBS.cpp:687), pid=4326, tid=140445634340608
#  guarantee(!failures) failed: there should not have been any failures

The GDB tool extracted the next stack trace from the core dump:

#0  0x00007fbc1ee41625 in raise () from /lib64/libc.so.6
#1  0x00007fbc1ee42e05 in abort () from /lib64/libc.so.6
#2  0x00007fbc1e2dc173 in os::abort (dump_core=true)
    at /usr/src/debug/java-1.8.0-openjdk-1.8.0.71-1.b15.el6_7.x86_64/openjdk/hotspot/src/os/linux/vm/os_linux.cpp:1500
#3  0x00007fbc1e4a9352 in VMError::report_and_die (this=0x7fbc0c2511c0)
    at /usr/src/debug/java-1.8.0-openjdk-1.8.0.71-1.b15.el6_7.x86_64/openjdk/hotspot/src/share/vm/utilities/vmError.cpp:1060
#4  0x00007fbc1de441c6 in report_vm_error (
    file=0x7fbc1e5848d0 "/builddir/build/BUILD/java-1.8.0-openjdk-1.8.0.71-1.b15.el6_7.x86_64/openjdk/hotspot/src/share/vm/memory/cardTableModRefBS.cpp",
    line=687, error_msg=0x7fbc1e5855c8 "guarantee(!failures) failed", detail_msg=0x7fbc1e5855a0 "there should not have been any failures")
    at /usr/src/debug/java-1.8.0-openjdk-1.8.0.71-1.b15.el6_7.x86_64/openjdk/hotspot/src/share/vm/utilities/debug.cpp:226
#5  0x00007fbc1dd1943a in CardTableModRefBS::verify_region (this=0x7fbc18040ef0, mr=..., val=32 ' ', val_equals=true)
    at /usr/src/debug/java-1.8.0-openjdk-1.8.0.71-1.b15.el6_7.x86_64/openjdk/hotspot/src/share/vm/memory/cardTableModRefBS.cpp:687
#6  0x00007fbc1df5f4e7 in G1SATBCardTableModRefBS::verify_g1_young_region (this=0x7fbc18040ef0, mr=...)
    at /usr/src/debug/java-1.8.0-openjdk-1.8.0.71-1.b15.el6_7.x86_64/openjdk/hotspot/src/share/vm/gc_implementation/g1/g1SATBCardTableModRefBS.cpp:123
#7  0x00007fbc1df3191f in G1CollectedHeap::verify_dirty_region (this=0x7fbc18029d10, hr=0x7fbc1845f330)
    at /usr/src/debug/java-1.8.0-openjdk-1.8.0.71-1.b15.el6_7.x86_64/openjdk/hotspot/src/share/vm/gc_implementation/g1/g1CollectedHeap.cpp:5984
#8  0x00007fbc1df3198b in G1CollectedHeap::verify_dirty_young_list (this=0x7fbc18029d10, head=0x7fbc1844ec30)
    at /usr/src/debug/java-1.8.0-openjdk-1.8.0.71-1.b15.el6_7.x86_64/openjdk/hotspot/src/share/vm/gc_implementation/g1/g1CollectedHeap.cpp:5993
#9  0x00007fbc1df319db in G1CollectedHeap::verify_dirty_young_regions (this=0x7fbc18029d10)
    at /usr/src/debug/java-1.8.0-openjdk-1.8.0.71-1.b15.el6_7.x86_64/openjdk/hotspot/src/share/vm/gc_implementation/g1/g1CollectedHeap.cpp:5998
#10 0x00007fbc1df2e483 in G1CollectedHeap::do_collection_pause_at_safepoint (this=0x7fbc18029d10, target_pause_time_ms=200)
    at /usr/src/debug/java-1.8.0-openjdk-1.8.0.71-1.b15.el6_7.x86_64/openjdk/hotspot/src/share/vm/gc_implementation/g1/g1CollectedHeap.cpp:3946
#11 0x00007fbc1e4e1dc4 in VM_G1IncCollectionPause::doit (this=0x7fbb701d20f0)
    at /usr/src/debug/java-1.8.0-openjdk-1.8.0.71-1.b15.el6_7.x86_64/openjdk/hotspot/src/share/vm/gc_implementation/g1/vm_operations_g1.cpp:152
#12 0x00007fbc1e4e0505 in VM_Operation::evaluate (this=0x7fbb701d20f0)
    at /usr/src/debug/java-1.8.0-openjdk-1.8.0.71-1.b15.el6_7.x86_64/openjdk/hotspot/src/share/vm/runtime/vm_operations.cpp:62
#13 0x00007fbc1e4dea8c in VMThread::evaluate_operation (this=0x7fbc1855e800, op=0x7fbb701d20f0)
    at /usr/src/debug/java-1.8.0-openjdk-1.8.0.71-1.b15.el6_7.x86_64/openjdk/hotspot/src/share/vm/runtime/vmThread.cpp:377
#14 0x00007fbc1e4df0b1 in VMThread::loop (this=0x7fbc1855e800)
    at /usr/src/debug/java-1.8.0-openjdk-1.8.0.71-1.b15.el6_7.x86_64/openjdk/hotspot/src/share/vm/runtime/vmThread.cpp:502
#15 0x00007fbc1e4de74d in VMThread::run (this=0x7fbc1855e800)
    at /usr/src/debug/java-1.8.0-openjdk-1.8.0.71-1.b15.el6_7.x86_64/openjdk/hotspot/src/share/vm/runtime/vmThread.cpp:276
#16 0x00007fbc1e2dad34 in java_start (thread=0x7fbc1855e800)
    at /usr/src/debug/java-1.8.0-openjdk-1.8.0.71-1.b15.el6_7.x86_64/openjdk/hotspot/src/os/linux/vm/os_linux.cpp:782
#17 0x00007fbc1f7d4aa1 in start_thread () from /lib64/libpthread.so.0
#18 0x00007fbc1eef793d in clone () from /lib64/libc.so.6

Now this is clear to me that the error occurs due to validation failure of card tables during VM_G1IncCollectionPause. I am suspecting that some memory corruption occurred.

The issue not reproducible on Oracle's JRE nor with default GC. We do use JNI calls.

Questions:

  1. Recommendations for additional tools to find how such scenario can happen
  2. Is the G1 implementation different between Oracle & OpenJDK projects?
  3. Why the card table validation is so important?

Update 1:

OpenJDK RPM Info:

Name        : java-1.8.0-openjdk-headless  Relocations: (not relocatable)
Version     : 1.8.0.71                          Vendor: CentOS
Release     : 1.b15.el6_7                   Build Date: Thu 21 Jan 2016 08:25:22 PM IST
Install Date: Wed 06 Apr 2016 09:28:27 AM IDT      Build Host: c6b8.bsys.dev.centos.org
Group       : Development/Languages         Source RPM: java-1.8.0-openjdk-1.8.0.71-1.b15.el6_7.src.rpm
Size        : 106657831                        License: ASL 1.1 and ASL 2.0 and GPL+ and GPLv2 and GPLv2 with exceptions and LGPL+ and LGPLv2 and MPLv1.0 and MPLv1.1 and Public Domain and W3C
Signature   : RSA/SHA1, Thu 21 Jan 2016 09:36:21 PM IST, Key ID 0946fca2c105b9de
Packager    : CentOS BuildSystem <http://bugs.centos.org>
URL         : http://openjdk.java.net/
Summary     : OpenJDK Runtime Environment
Description :
The OpenJDK runtime environment without audio and video support.

Update 2:

I managed to reproduce the crashes under Oracle's JRE version 8 update 77, but the error occurs at different method:

Crash Error 1:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fc00c86f514, pid=13705, tid=140459499214592
#
# JRE version: Java(TM) SE Runtime Environment (8.0_77-b03) (build 1.8.0_77-b03)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.77-b03 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.so+0xa28514]  ObjectSynchronizer::inflate(Thread*, oopDesc*)+0x384

Stack: [0x00007fbf46865000,0x00007fbf468e6000],  sp=0x00007fbf468e0ab0,  free space=494k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0xa28514]  ObjectSynchronizer::inflate(Thread*, oopDesc*)+0x384
V  [libjvm.so+0xa299c4]  ObjectSynchronizer::FastHashCode(Thread*, oopDesc*)+0x74
V  [libjvm.so+0x70ffa8]  JVM_IHashCode+0xb8
J 605  java.lang.System.identityHashCode(Ljava/lang/Object;)I (0 bytes) @ 0x00007fbffa475eff [0x00007fbffa475e40+0xbf]
J 241498 C2 java.util.IdentityHashMap.resize(I)Z (153 bytes) @ 0x00007fbffb1bbec4 [0x00007fbffb1bbd60+0x164]
J 20915 C2 org.hibernate.engine.internal.EntityEntryContext.addEntityEntry(Ljava/lang/Object;Lorg/hibernate/engine/spi/EntityEntry;)V (212 bytes) @ 0x00007fbffbe4691c [0x00007fbffbe45c00+0xd1c]

Crash Error 2:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f5d3a49a3a5, pid=30526, tid=140038358337280
#
# JRE version: Java(TM) SE Runtime Environment (8.0_77-b03) (build 1.8.0_77-b03)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.77-b03 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.so+0x5c13a5]  G1ParScanThreadState::copy_to_survivor_space(InCSetState, oopDesc*, markOopDesc*)+0x45
#

Stack: [0x00007f5d388a8000,0x00007f5d389a9000],  sp=0x00007f5d389a72f0,  free space=1020k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x5c13a5]  G1ParScanThreadState::copy_to_survivor_space(InCSetState, oopDesc*, markOopDesc*)+0x45
V  [libjvm.so+0x5c1f5b]  G1ParScanThreadState::trim_queue()+0x4ab
V  [libjvm.so+0x59ca28]  G1ParEvacuateFollowersClosure::do_void()+0x208
V  [libjvm.so+0x5a8373]  G1ParTask::work(unsigned int)+0x463
V  [libjvm.so+0xadac8f]  GangWorker::loop()+0xcf
V  [libjvm.so+0x91d9d8]  java_start(Thread*)+0x108

The issue reported to Oracle's team.

Maxim Kirilov
  • 2,639
  • 24
  • 49

2 Answers2

2

The crash has happened V [libjvm.so+0x5c13a5] G1ParScanThreadState::copy_to_survivor_space(InCSetState, oopDesc*, markOopDesc*)+0x45 This can be caused by any bug that corrupts heap memory. It could be an issue with GC, with the compiler, with bad native code, If you have strong reproducer report to Oracle with reproducible tests and steps to reproduce. Your issue will be addressed.

Fairoz
  • 1,616
  • 13
  • 16
1

You need to clarify where you got the OpenJDK from; there are lots of sources and some of them aren't necessarily the same as the explicit tags available in the source tree. You might like to try something like Azul Zulu, which is an open-source version that has undergone the TCK test suite.

You can have a look at what the failure is by comparing the source code; here's a GitHub mirror of the HotSpot's verify_region function; there should have been some output as well to the tty that explained more.

If you run the process under gdb and put a breakpoint in the end (just before the assertion) of the CardTableModRefBS::verify_region function, you might be able to debug the failure further.

You could also try running with different card marking, such as -XX:+UseCondCardMark or -XX:-UseCondCardMark to see if switching those makes a difference.

In general there aren't differences between Oracle JDK and OpenJDK as far as the VM and GC are concerned; however, the version of OpenJDK that you have installed may not actually be from a tested or supplied source. So you should really try with a certified version of OpenJDK first before you go too much further.

AlBlue
  • 23,254
  • 14
  • 71
  • 91
  • Hi, thank you for your detailed information. I updated the question with OpenJDK's RPM information. Meanwhile I managed to reproduce the crashes on Oracle's JRE as well. The crash reported to the relevant team. I will update my question as soon as they will answer. – Maxim Kirilov Apr 14 '16 at 08:20
  • Tried to use different card marking according to your suggestion, didn't help. – Maxim Kirilov Apr 21 '16 at 06:04
  • One other possibility is that you have faulty RAM and that's causing the error. Might be worth running memtest or equivalent to rule that out as a possibility. – AlBlue Apr 21 '16 at 07:05
  • This is a VM, reproduced on any installation. – Maxim Kirilov Apr 21 '16 at 07:49