17

My Java application has started to crash regularly with a SIGSEGV and a dump of stack data and a load of information in a text file.

I have debugged C programs in gdb and I have debugged Java code from my IDE. I'm not sure how to approach C-like crashes in a running Java program.

I'm assuming I'm not looking at a JVM bug here. Other Java programs run just fine, and the JVM from Sun is probably more stable than my code. However, I have no idea how I could even cause segfaults with Java code. There definitely is enough memory available, and when I last checked in the profiler, heap usage was around 50% with occasional spikes around 80%. Are there any startup parameters I could investigate? What is a good checklist when approaching a bug like this?

Though I'm not so far able to reliably reproduce the event, it does not seem to occur entirely at random either, so testing is not completely impossible.

ETA: Some of the gory details

(I'm looking for a general approach, since the actual problem might be very specific. Still, there's some info I already collected and that may be of some value.)

A while ago, I had similar-looking trouble after upgrading my CI server (see here for more details), but that fix (setting -XX:MaxPermSize) did not help this time.

Further investigation revealed that in the crash log files the thread marked as "current thread" is never one of mine, but either one called "VMThread" or one called "GCTaskThread"- I f it's the latter, it is additionally marked with the comment "(exited)", if it's the former, the GCTaskThread is not in the list. This makes me suppose that the problem might be around the end of a GC operation.

Cœur
  • 37,241
  • 25
  • 195
  • 267
Hanno Fietz
  • 30,799
  • 47
  • 148
  • 234
  • Can you get a stack trace? Is it SEGV at the same place? Could we have more info to work on? – Ed Heal Aug 30 '11 at 22:45
  • Is there any native code in your application? If the JVM allows any collection of bytecode, no matter how buggy that bytecode may be, to provoke a segfault, then _ipso facto_ your're looking at a JVM (or JRE) bug. – hmakholm left over Monica Aug 30 '11 at 22:58
  • @Ed - I have plenty of stack trace, but it's a huge wall of text. What part would be most useful to post? I'm mainly looking for a general way to approach this type of problem, therefore I'm hesitant to dump a load of very specific info here. – Hanno Fietz Aug 30 '11 at 22:59
  • @Henning - Maybe. I have statically weaved classes (the eclipselink ORM). In fact, I started seeing the problem after I introduced them (before I had dynamic weaving, which turned out to not work). However, without the weaved classe, I have a whole different problem set which might well have obscured the segfaults, so I can't assume causality here. – Hanno Fietz Aug 30 '11 at 23:03
  • @Henning - I also have profiler classes added to `-Xbootclasspath`, and I don't really understand either how the profiler works and what the bootclasspath is, exactly. Additionally, I'm running in debug mode (with `-Xdebug -Xrunjdwp`), if it matters. – Hanno Fietz Aug 30 '11 at 23:09
  • @Ed - I had a look at the various log files that were produced by the crashes, and the "problematic frame" line at the top gives me all sorts of places in the `libjvm.so`, so, no, the segfault doesn't seem to be in the same place each time. – Hanno Fietz Aug 30 '11 at 23:21
  • @HannoFietz: You may use GDB to trace Java application as well, why not? I've also used CygWin GDB to trace down the problem on Win32 platform. What to do you mean with "statically weaved classes". AFAIK, weaving can be wither compile-time or load-time. Later means that there is Java bytecode injector (e.g. CGLIB) which is a point of minor attention (I believe it can't crash interpreting machine = JVM). What Henning was asking about: do you have JNI/JNA bridges to native code? Any suspicious mapped SO libraries? – dma_k Aug 30 '11 at 23:58
  • @dma_k - No JNI that I know of. By "static weaving", I'm referring to compile-time weaving. Wouldn't I have to recompile the JVM (with debug symbols) to debug it in GDB? – Hanno Fietz Aug 31 '11 at 00:03
  • Are you running a MacOSX with 1.5 java? It could throw segfaults in case of a stack overflow. Try to increase stack size. – Denis Tulskiy Aug 31 '11 at 03:50
  • @HannoFietz: Recompiling JVM might be adventurous. First try with different JVMs (IBM, OpenJDK) – maybe this will bring you some idea. – dma_k Sep 04 '11 at 11:58
  • @HannoFietz: Have you run this same setup on a different machine? Faults like you describe are not uncommon around hardware failure, such as bad memory. Also, do you have details (or did I miss them) about the hardware/OS you are running with? And, are you on a virt or actual hardware? – philwb Sep 10 '11 at 16:55
  • @dma_k - I tried OpenJDK and Sun, but as I understand it, these are based on the same source code, is that right? – Hanno Fietz Sep 14 '11 at 07:46
  • @philwb - I have not. I did get new RAM installed on the machine, but I'm not certain that this coincided. Will add machine details. – Hanno Fietz Sep 14 '11 at 07:48
  • @HannoFietz: I'd still recommend running on a separate machine to see if you crash there, as well, or, at least running memory and other hardware diagnostics on the current box to see if something turns up. – philwb Sep 14 '11 at 13:55
  • Which OS are you running on? Which JVM? Do you have native code in your app? Which GC are you using? – kittylyst Sep 14 '11 at 17:21

5 Answers5

23

I'm assuming I'm not looking at a JVM bug here. Other Java programs run just fine, and the JVM from Sun is probably more stable than my code.

I don't think you should make that assumption. Without using JNI, you should not be able to write Java code that causes a SIGSEGV (although we know it happens). My point is, when it happens, it is either a bug in the JVM (not unheard of) or a bug in some JNI code. If you don't have any JNI in your own code, that doesn't mean that you aren't using some library that is, so look for that. When I have seen this kind of problem before, it was in an image manipulation library. If the culprit isn't in your own JNI code, you probably won't be able to 'fix' the bug, but you may still be able to work around it.

First, you should get an alternate JVM on the same platform and try to reproduce it. You can try one of these alternatives.

If you cannot reproduce it, it likely is a JVM bug. From that, you can either mandate a particular JVM or search the bug database, using what you know about how to reproduce it, and maybe get suggested workarounds. (Even if you can reproduce it, many JVM implementations are just tweaks on Oracle's Hotspot implementation, so it might still be a JVM bug.)

If you can reproduce it with an alternative JVM, the fault might be that you have some JNI bug. Look at what libraries you are using and what native calls they might be making. Sometimes there are alternative "pure Java" configurations or jar files for the same library or alternative libraries that do almost the same thing.

Good luck!

jhericks
  • 5,833
  • 6
  • 40
  • 60
  • 4
    +1 for "you probably won't be able to 'fix' the bug": so the answer to the poster's question of "How do I debug Segfaults occuring in the JVM when it runs my code?" is "*you* don't". – Raedwald Sep 13 '11 at 11:57
9

The following will almost certainly be useless unless you have native code. However, here goes.

  1. Start java program in java debugger, with breakpoint well before possible sigsegv.
  2. Use the ps command to obtain the processid of java.
  3. gdb /usr/lib/jvm/sun-java6/bin/java processid
  4. make sure that the gdb 'handle' command is set to stop on SIGSEGV
  5. continue in the java debugger from the breakpoint.
  6. wait for explosion.
  7. Use gdb to investigate

If you've really managed to make the JVM take a sigsegv without any native code of your own, you are very unlikely to make any sense of what you will see next, and the best you can do is push a test case onto a bug report.

bmargulies
  • 97,814
  • 39
  • 186
  • 310
  • 1
    Would that require a special version of the JVM? From C, I'm used to having to recompile with debug symbols when I want to use gdb. – Hanno Fietz Sep 14 '11 at 07:53
  • The JVM in my experience always has enough symbols for backtraces. If you really intend to debug it in detail, well, off to openJDK and a debug build. – bmargulies Sep 14 '11 at 12:29
2

I found a good list at http://www.oracle.com/technetwork/java/javase/crashes-137240.html. As I'm getting the crashes during GC, I'll try switching between garbage collectors.

I tried switching between the serial and the parallel GC (the latter being the default on a 64-bit Linux server), this only changed the error message accordingly.

Reducing the max heap size from 16G to 10G after a fresh analysis in the profiler (which gave me a heap usage flattening out at 8G) did lead to a significantly lower "Virtual Memory" footprint (16G instead of 60), but I don't even know what that means, and The Internet says, it doesn't matter.

Currently, the JVM is running in client mode (using the -client startup option thus overriding the default of -server). So far, there's no crash, but the performance impact seems rather large.

Hanno Fietz
  • 30,799
  • 47
  • 148
  • 234
0

Try to check whether c program carsh which have caused java crash.use valgrind to know invalid and also cross check stack size.

Rohit
  • 142
  • 8
0

If you have a corefile you could try running jstack on it, which would give you something a little more comprehensible - see http://download.oracle.com/javase/6/docs/technotes/tools/share/jstack.html, although if it's a bug in the gc thread it may not be all that helpful.

Alan Burlison
  • 1,022
  • 1
  • 9
  • 16