0

I am using ColdFusion 11 and there are more than 20 web sites setup using IIS.

The issue I am facing, in Windows tasks manager memory for the ColdFusion process is keep increasing and when it crosses 90% then all the sites stop working and messages show “Internal server error”. After restarting the server, it started working again.

Server has a total 30GB memory and 20GB is allocated to ColdFusion JVM. FusionReactor is also installed on my server.

Here are some of my findings.

  1. Database server is working properly. All the queries are completed in a few seconds.

  2. I monitored the Memory Heap graph in FusionReactor. It continually increases and it is also going out of maximum heap size. https://www.screencast.com/t/lHGg1my57K

  3. In coldfusion-out.log, following repeated errors logged before server crash.

Jun 11, 2020 20:39:24 PM Error [ajp-bio-8014-exec-1186] - Event Name: onSessionStart
Jun 11, 2020 20:39:24 PM Error [ajp-bio-8014-exec-1186] - Message: Event handler exception.
Jun 11, 2020 20:39:24 PM Error [ajp-bio-8014-exec-1186] - Detail: An exception occurred while invoking an event handler method from Application.cfc. The method name is: onSessionStart.
Jun 11, 2020 20:39:24 PM Error [ajp-bio-8014-exec-1186] - Type: Database
Jun 11, 2020 20:39:25 PM Error [ajp-bio-8014-exec-1156] - Event Name: onRequestStart
Jun 11, 2020 20:39:25 PM Error [ajp-bio-8014-exec-1156] - Message: Event handler exception.

Other messages that repeating in log are.

Jun 11, 2020 20:27:56 PM Error [ajp-bio-8014-exec-462] - Event Name: 
Jun 11, 2020 20:27:56 PM Error [ajp-bio-8014-exec-462] - Message: The request has exceeded the allowable time limit Tag: CFLOOP
Jun 11, 2020 20:28:27 PM Error [ajp-bio-8014-exec-475] - Event Name: 
Jun 11, 2020 20:28:27 PM Error [ajp-bio-8014-exec-475] - Message: The request has exceeded the allowable time limit Tag: CFQUERY

Does someone have an idea how to investigate and find the root cause?

  • 3
    It says `The request has exceeded the allowable time limit Tag: CFLOOP`. Did you look for a loop that does not finish? – Bernhard Döbler Jun 12 '20 at 14:18
  • Fusion Reactor should show you long running queries and long running requests. You can use that to find whatever's running that long loop. Odds are you have some query returning a huge record set, then have code looping over those records calling more queries. FR records all queries that have completed, but check Requests > Activity and scroll to the bottom to see any long-running requests that haven't completed. – Adrian J. Moreno Jun 12 '20 at 14:46
  • @BernhardDöbler I am unable to find the place where CFLOOP is causing issue. How I can reach there? I really appreciate your help – Techleadz Solution Jun 12 '20 at 14:47
  • `The request has exceeded the allowable time limit Tag: CFLOOP` this just means that the timeout occured while processing cfloop code. There could be some code before the loop which took a long time. – rrk Jun 12 '20 at 14:56
  • 1
    It sounds like you may have one or more memory leaks in one or more of those sites - but if it was a memory leak I would expect you to be seeing `GC overhead limit exceeded` OR `Java heap space` errors at some point - have you checked to confirm that you aren't seeing a large surge in traffic, eg from an attack? Those request timeouts in the logs just before the server fell over, are just as likely (or in my experience more likely) to be innocent victims of the lack of resources at that time, than they are likely to be the cause of the problems. – Sev Roberts Jun 12 '20 at 14:59
  • There are a few ways to investigate. One is to generate a heap dump from the JVM and analyze that heap dump in a Java memory profiler app. I've used "YourKit" for this before, other profilers are available. Try googling `HeapDumpOnOutOfMemoryError` for info on the JVM arg to automate a head dump as it crashes. But a 20GB heap dump is likely to be very slow to analyze unless you have a box with at least that much RAM available to open it! And even if you have a powerful machine, it's a steep learning curve to understand the structures represented in memory if you've never done it before. – Sev Roberts Jun 12 '20 at 15:21
  • @SevRoberts I have not found errors like GC overhead limit exceeded OR Java heap space in log. Memory heap is gradually but continuously increasing. Is this sign of traffic attack? – Techleadz Solution Jun 12 '20 at 15:24
  • It might be a sign of an inefficient process. I would want to know why the queries are taking so long. – James A Mohler Jun 12 '20 at 15:26
  • @JamesAMohler Queries are not taking time. Some slow requests are showing in FusionReactor, but when I hit that page or run its query, it process in seconds. – Techleadz Solution Jun 12 '20 at 15:31
  • Another option is to get a CF dev who is already experienced in performance troubleshooting and optimisation, to examine your code. There are some common causes of memory leaks that are easier for the trained eye to spot. Eg incorrectly scoped variables leaking into application or session scopes and therefore not being garbage collected; a sloppy condition causing infinite recursion; storing large objects in the session whilst having a long session timeout and many sessions. But such things will be almost impossible to diagnose over Stackoverflow comments without seeing the source. – Sev Roberts Jun 12 '20 at 15:34
  • Can you confirm whether the `The request has exceeded the allowable time limit` messages are only seen close to the time when the server appears to be running out of memory, or are those messages seen regularly even when there is plenty of memory available after a restart? – Sev Roberts Jun 12 '20 at 15:39
  • @SevRoberts This message is only appearing at the time when server is running out of memory. – Techleadz Solution Jun 12 '20 at 15:44
  • 1
    I'm with Sev here. Don't focus on the request timeouts as they are most likely related to the "stop the world" garbage collection once the heap is at maximum. I would suggest you to take a dump of the heap and then analyze it with [Eclipse Memory Analyzer](https://www.eclipse.org/mat/). This free tool comes with a preset for memory leak inspection, which is extremly helpful. As for the heap dump: Don't take the dump if your server is already on its knees. Since the memory increases gradually, you will already be able to find suspects with a much smaller sample. So take one at like 8 GB and go. – Alex Jun 12 '20 at 18:17
  • How quickly and consistently does it fail? Most of my issues in the past were due to very long running loops over many queries. Something to keep an eye out for is that CF builds up info about each query and its parameters and doesn't release it until end of request, so tens of thousands of queries in one requests starts to be an issue. – Dan Roberts Jun 12 '20 at 18:22
  • CF can garbage collect at the end of the request and at the end of each function call. So the issue Dan describes happens when devs do many iterations of a loop that does X. If that code is refactored into a function that does X, and a loop that calls function X many times, the memory can be released at the end of each function call, instead of hanging onto all mem until the end of the request. But that issue usually gives big spikes during one request which are eventually released when it completes or falls over. The OPs graph looks more like the slow steady increase of a memory leak. – Sev Roberts Jun 12 '20 at 23:04

1 Answers1

0

I fixed this issue by changing the server settings. I disabled the CF server monitoring and adjusted the Request Limits in CF Administrator. Thanks @Sev_Roberts to share your knowledge.