2

I am currently in the process of investigating a very peculiar problem on our lab servers. Whenever we run a java program on a machine with a 64-bit SUSE SLES11 installation that has been accessed with Citrix, it just hangs. I have the latest updates on the machine but it doesn't help. If any of these circumstances change, it works: 32-bit OS, SLES10.2, access via Cygwin/Exceed and other X applications such as xclock work fine.

This might look like a ServerFault question so far, but what I'm actually looking for is suggestions on software I can use to trace what this software is actually doing. Where it hangs is on a "FUTEX_WAIT" (found by using strace):

futex(0x7f4e3eaab9e0, FUTEX_WAIT, 19686, NULL

The cursor just stops in the trace just after the NULL and just stays there indefinitely. I have found a previous bug report that looks a little similar to this problem, but the circumstances are very different.

UPDATE: Apparently, futex_wait problems are a sign of strange race conditions in the kernel/libc locking up processes. I will have to try with a newer kernel/libc and see if either of that makes any difference.

UPDATE2: kernel/libc changes made no difference. Did manage to start up jvisualvm and hang it with a predictable external JMX port and connected to that from another machine at which point I found this in the thread trace for main:

Name: main
State: RUNNABLE
Total blocked: 0  Total waited: 0

Stack trace: 
sun.awt.X11GraphicsDevice.getDoubleBufferVisuals(Native Method)
sun.awt.X11GraphicsDevice.makeDefaultConfiguration(X11GraphicsDevice.java:208)
sun.awt.X11GraphicsDevice.getDefaultConfiguration(X11GraphicsDevice.java:182)
   - locked java.lang.Object@1c190c99
sun.awt.X11.XToolkit.<clinit>(XToolkit.java:92)
java.lang.Class.forName0(Native Method)
java.lang.Class.forName(Class.java:169)
java.awt.Toolkit$2.run(Toolkit.java:834)
java.security.AccessController.doPrivileged(Native Method)
java.awt.Toolkit.getDefaultToolkit(Toolkit.java:826)
   - locked java.lang.Class@308a1f38
org.openide.util.ImageUtilities.ensureLoaded(ImageUtilities.java:519)
org.openide.util.ImageUtilities.access$200(ImageUtilities.java:80)
org.openide.util.ImageUtilities$ToolTipImage.createNew(ImageUtilities.java:699)
org.openide.util.ImageUtilities.getIcon(ImageUtilities.java:487)
   - locked java.util.HashMap@3c07ae6d
org.openide.util.ImageUtilities.getIcon(ImageUtilities.java:361)
   - locked java.util.HashMap@1c4c94e5
org.openide.util.ImageUtilities.loadImage(ImageUtilities.java:139)
org.netbeans.core.startup.Splash.loadContent(Splash.java:262)
org.netbeans.core.startup.Splash$SplashComponent.<init>(Splash.java:344)
org.netbeans.core.startup.Splash.<init>(Splash.java:170)
org.netbeans.core.startup.Splash.getInstance(Splash.java:102)
org.netbeans.core.startup.Main.start(Main.java:301)
org.netbeans.core.startup.TopThreadGroup.run(TopThreadGroup.java:110)
java.lang.Thread.run(Thread.java:619)

Tried the deadlock detection button in jvisualvm but it discovered no deadlocks.

Currently talking to Citrix Europe about this problem and delivering traces to them. Will update this question if it gets solved.

UPDATE 3: This problem has been traced to Citrix and has been submitted with service request number 60235154. Seems like the problem is either somewhere in Java or in the Citrix implementation of X11 at the moment.

Stefan Thyberg
  • 3,445
  • 3
  • 23
  • 29
  • We are currently having the same problem on a Citrix client. Seeing as the original post is more than 6 years old, are there any news to this? – gsl Feb 16 '17 at 16:51

5 Answers5

2

ltrace traces shared-library function calls. That can give you a higher-level view of things. But it can also spew tons more output than strace, since many library functions (e.g. strcmp) don't result in system calls.

But futex is used for locking, so if you get stuck at futex, you probably deadlocked. Or you're just looking at one thread which is waiting for other threads. ltrace/strace -f follows clone/fork to trace all threads/all child processes.

In gdb, sometimes thread apply all <command> is useful for multithreaded processes. e.g. thread apply all bt

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
1

Do you have source code for the Java program? If so, you can remotely debug it using Eclipse or another IDE. If you don't have source code, your options are more limited, but you can try connecting to the process via JConsole to gain some insight into what's happening. Java profiling tools are another option, but harder to set up.

Rob H
  • 14,502
  • 8
  • 42
  • 45
  • One of the problems here is that jConsole is one of the programs that are failing, otherwise I'd do just that. I'll try starting eclipse, but I suspect that's going to hang as well. – Stefan Thyberg Oct 10 '09 at 16:17
  • You can run JConsole remotely. Make sure it's the Java 6 version. – Rob H Oct 11 '09 at 16:50
  • We use jconsole and jvisualvm to manifest this problem. It doesn't seem possible to pass VM arguments to jconsole but it's possible with jvisualvm so I did that today and connected to it remotely after it hung with another instance of jvisualvm on another machine, which gave some more clues about what might be wrong. Updated my question with the result. – Stefan Thyberg Jan 12 '10 at 12:13
1

Maybe jvisualvm, which comes with the java from Sun, has what you need. You can record the state of the virtual machine as your program is running and also tell it to save any stack dumps to a file you can later open and look at. Look for jvisualvm in the bin directory of your jdk. Here's where you can see more documentation: http://java.sun.com/javase/6/docs/technotes/tools/share/jvisualvm.html

Good luck!

mring
  • 1,717
  • 2
  • 13
  • 28
0

Use gdb to attach to the process. gdb isn't exactly intuitive, but there are a lot of howtos and similar on the net.

http://dirac.org/linux/gdb/06-Debugging_A_Running_Process.php

Gunther Piez
  • 29,760
  • 6
  • 71
  • 103
0

See this solution I have found.

In this case the hangs were caused by slow generation of random bytes from /dev/random.

The Java application waits for very long time to get random bytes.

This is not really a solution, but rather a workarround since the /dev/random will become the same as /dev/urandom.

TTT
  • 179
  • 2
  • 2