Can't find reason for progressive slow down in API response time despite extensive profiling, any ideas?

Question

Our application exposes a JSON-RPC API, using Jetty and Google Gson to parse/generate the JSON code.

A remote system opens several thousand persistent HTTP connections to our application, and starts sending API requests at the rate of about 50 per second. Our application generates a response using CPU only (ie. no disk or database access). Our app is running on an EC2 virtual machine.

When our app first starts its typical response time is 1-2ms, however over the course of several hours this increases steadily until eventually it gets as high as 80ms which is too slow for our application. Here is a graph showing response time in nanoseconds, note the steady increase.

Graph of API response time in Nanoseconds

I have used YourKit profiler to capture CPU snapshots shortly after startup and then again later when it's significantly slower. The problem is that no one method seems to account for the slowdown, everything just gets slower with time.

The number of threads and memory usage don't seem to increase either, so I'm now at a loss as to the likely cause of the slowdown.

Does anyone have any ideas as to what the cause might be, or a suggestion as to a more effective way to isolate the problem?

Some questions: Does the increase in processing time continue to rise until the server is restarted? or does it roller coaster on the graph? How long lived are the persistent connections? — Joakim Erdfelt, Feb 25 '13 at 19:17
The increase continues until the server is restarted, I've seen it go as high as 80ms, which is when I restarted it. The persistent connections last 30 seconds, at which point they are terminated and a new connection is opened. — sanity, Feb 25 '13 at 19:18
Have you collected and graphed your GC logs to see if there is a similar trend in memory usage and GC events? — Joakim Erdfelt, Feb 25 '13 at 19:29
I'm seeing the occasional minor collection ("PS Scavenge" according to yourkit), they seem to occur every minute or so initially, but after 90 minutes they're occurring every 2-3 seconds. I'm not sure if that is sufficient to account for the slowdown though. — sanity, Feb 25 '13 at 19:38
I am leaning towards GC/memory problems. Take a memory snapshot after 90 mins and see if there is something strange. Like Joakim said, GC logs will help a lot more than a profiler to understand what's going on. A young GC every 2-3 seconds typically suggests a too small heap. Finally, I would not be surprised if the slowdown is actually caused by the profiling tools you're using; have you tried without ? — sbordet, Feb 25 '13 at 20:30
I've got a 7GB heap, so it seems unlikely that it would be too small. I don't think the profiler is causing the problem as I first observed the problem before attaching the profiler. I'm currently trying it with the G1 garbage collector to see if that helps. — sanity, Feb 25 '13 at 20:39
G1 didn't help, trying again without the profiler to see whether it was the culprit. — sanity, Feb 25 '13 at 21:18
Please post your GC settings. Are you setting G1 latency targets? Have you tried running with a smaller heap? — Nitsan Wakart, Mar 12 '13 at 22:38

Can't find reason for progressive slow down in API response time despite extensive profiling, any ideas?

0 Answers0