7

One day our java web application goes up to 100% CPU usage. A restart solve the incident but not the problem because a few hours after the problem came back. We suspected a infinite loop introduced by a new version but we didn't make any change on the code or on the server.

We managed to find the problem by making several thread dumps with kill -QUIT and by looking and comparing every thread details. We found that one thread call stack appear in all the thread dumps. After analysis, there was a while loop condition that never go false for some data that was regularly updated in the database.

The analysis of several thread dumps of web application is really tedious.

So do you know any better way or tools to find such issue in a production environment ?

Franck
  • 944
  • 14
  • 28
  • possible duplicate of [Rare infinite loop in code, don't want to wait for it to happen again](http://stackoverflow.com/questions/5753268/rare-infinite-loop-in-code-dont-want-to-wait-for-it-to-happen-again) – Joachim Sauer May 04 '11 at 12:46
  • 3
    It's rare to find an exact duplicate of such a specific question ;-) – Joachim Sauer May 04 '11 at 12:46
  • I don't think this is a dupe, especially since his self-answer found monitoring software that helps him. – John Saunders May 08 '11 at 16:18

3 Answers3

7

After some queries, I found an answer in Monitoring and Managing Java SE 6 Platform Applications :

You can diagnose looping thread by using JDK’s provided tool called JTop that will show the CPU time each thread is using: enter image description here

With the thread name, you can find the stack trace of this thread in the “Threads” tab of by making a thread dump with a kill -QUIT.

You can now focus on the code that cause the infinite loop.

PS.: It seems OK to answer my own question according to https://blog.stackoverflow.com/2008/07/stack-overflow-private-beta-begins/ : […] “yes, it is OK and even encouraged to answer your own questions, if you find a good answer before anyone else.” […]

PS.: In case sun.com domain will no longer exists: You can run JTop as a stand-alone GUI:

$ <JDK>/bin/java -jar <JDK>/demo/management/JTop/JTop.jar

Alternately, you can run it as a JConsole plug-in:

$ <JDK>/bin/jconsole -pluginpath <JDK>/demo/management/JTop/JTop.jar 
Community
  • 1
  • 1
Franck
  • 944
  • 14
  • 28
3

Fix the problem before it occurs! Use a static analysis tool like FindBugs or PMD as part of your build system. It won't find everything, but it is a good first step.

David Grant
  • 13,929
  • 3
  • 57
  • 63
1

Think of using coverage tools like Cobertura. It would have shown you, that you didn't test these code-paths.

Testing sth. like this can become really cumbersome, so try to avoid this by introducing quality measurements.

Anyways tools like VisualVM will give you a nice overview of all threads, so it becomes relatively easy to identify threads which are working for an unexpectedly long time.

ffray
  • 11
  • 1
  • 2
    Code coverage is not enough. Even if we've have 100% code coverage it depends on the value of a string argument. See http://www.ibm.com/developerworks/java/library/j-cq01316 – Franck May 04 '11 at 13:03
  • 1
    Not to be blunt, but I could make the argument that your tests weren't sufficient then. This isn't so much a criticism of your testing practices as it is a suggestion to use circumstances like these to improve them. You just learned that a past assumption was false. You now need tests that expose the conditions of that assumption. – Mike Yockey May 04 '11 at 13:19
  • @yock: are you answering Frank? If so, I don't see the connection. Yes, good tests could possibly have found that infinite loops. But 100% test coverage doesn't necessarily mean good tests and thus pure code coverage is not a valid way to measure test quality. Is there something I'm missing? – Joachim Sauer May 04 '11 at 13:29
  • I was answering Franck, yes. Code coverage by itself isn't terribly useful for anything other than convincing management that you're producing *something*. – Mike Yockey May 04 '11 at 13:31
  • Coverage is a measurement indicator only. But it's good enough to tell you which parts of your code have been tested and which have not. If you would have tested this, there's no need for any coverage tools and also not for checking infinite loops in production :-) – ffray May 04 '11 at 13:33
  • I edited the question to focus on solving the problem in a production environment. Prevention methods are great but not always enough. – Franck May 04 '11 at 13:35