0

We are currently testing an upgrade from CF11 to CF2018 for my company's intranet. To give you an idea how long this site has been running, our first version of CF was 3.1! It is still using application.cfm, and there is code from 1998, when I started writing this thing. Yes, 21 years -- I'm astonished, too. It is a hodgepodge of all kinds of older frameworks, too, including Fusebox.

Anyway, we're running Win 2012 VM connected to a SQL 2016 farm. Everything looked OK initially, but in the Week I've been testing, the server has come to a slowdown once (a page took more than 5 seconds to run, something that usually takes 100ms, no DB involvement), and another time, the server came to a grinding halt. The only way I could restart CF App service was by connecting to the server with another server via Services, because doing it via Remote Desktop was so slow.

Now keep in mind -- it's just me testing. This is a site that doesn't have a ton of users, but still, having 5 concurrent connections is normal and there are upwards of 200-400 users hitting this thing every day.

I have FusionReactor running on this thing now, so the next time a lockup happens, I will be able to take a closer look, but what do you think is the best way I can test this? Our site is mostly transactional, users going and filling out forms to put internal orders through. We also connect to XML web services and REST services; we also provide REST services, too. Obviously there's no way to completely replicate a production server's requests onto a test server, but I need to do more thorough testing. Any advice would be hugely appreciated.

Sung
  • 480
  • 2
  • 11
  • 2
    Real quick check is to look at the JVM settings. The default memory settings on ACF are IMHO, too low. – James A Mohler Aug 06 '19 at 01:49
  • Are you testing the CF2018 site on a different server to your CF11 site? Do you have both sites running at the same time? If so, can you compare the setting summaries from Coldfusion Administrator – Pete Aug 06 '19 at 05:59
  • Also along those lines, is there anything else running on the CF server during those times? Like backups, indexing or some other software. – Miguel-F Aug 06 '19 at 12:19
  • @JamesAMohler I changed the JVM to 2048 min and 2048 max, which is the way I have it on prod. Haven't seen another lockup, but I still had that weird 5-second request for a page that should've taken 100ms. – Sung Aug 06 '19 at 13:26
  • @Pete I upgraded all of our lower environments to CF2018, so CF11 is still there; I'm following the upgrade guidelines. Prod is still running CF11, so I can see all my settings. – Sung Aug 06 '19 at 13:28
  • @Miguel-F Nothing else is running during that time. When the server locked up, it was pretty obvious what was causing it. As soon as CF app service was stopped, the server went back to normal. – Sung Aug 06 '19 at 13:29
  • I appreciate all of your responses! Any ideas on how best to replicate real production conditions on a test system? That's what I need to do here. – Sung Aug 06 '19 at 13:31
  • You can [simulate concurrent requests and load](https://github.com/tsenart/vegeta). Monitor your server and wait for suspicious activity and performance peaks. FusionReactor is pretty good for that. – Alex Aug 06 '19 at 19:57
  • Thanks, @Alex -- I'll give that load tester a shot! – Sung Aug 07 '19 at 20:17

1 Answers1

2

I realize your focus for now is trying to recreate the problem on test. That may not be as easy as hoped. Instead, you should be able to understand and resolve it in production. FusionReactor can help, but the answer may well be in the cf logs.

You don't mention assessing the logs at the time of the hangup. See especially the coldfusion-error log, for outofmemory conditions.

You mention raising the heap, but the problem may be with the metaspace instead. If so, consider simply removing the maxmetaspace setting in the jvm args. That may be the sole and likely cause of such new and unexpected outages.

Or if it's not, and there's nothing in the logs at the time, THEN do consider FR. Does IT show anything happening at the time?

If not then consider a need to tune the cf/web server connector. I assume you're using iis. How many sites do you have? And how many connectors (folders in the cf config/wsconfig folder)? What are the settings in their workers.properties file? Are they optimized for the number of sites using that connector?

Also, have you updated cf2018? Are there any errors in the update error log? Did you update the web server connector also?

Are you running the cf2018 pmt (performance monitoring tool set)? Have you updated it?

There could be still more to consider, but let's see how it goes with those. I have blog posts on these and many more topics that would elaborate on things, both at my site (carehart.org) and the Adobe cf portal (coldfusion.adobe.com).

But let's hear if any of this gets you going.

Pete
  • 4,542
  • 9
  • 43
  • 76
charlie arehart
  • 6,590
  • 3
  • 27
  • 25
  • Charlie, thank you so much for this -- I really appreciate it. I have FR installed on my test server now so I'll definitely use it if the issue happens again. When the crash happened, I believe I was in the middle of updating a .cfc, and the session had timed out, so when the page request happened, it pulled from the new .cfc. Why this would have such an effect, I have no idea, but it's at least a place for me to attempt to replicate the issue. – Sung Aug 07 '19 at 20:13
  • As far as the idea of resolving the issue in production...that just won't fly. What if this happens and the site goes down over and over again? So I'll do my best to replicate the issue in my lower environments. This is just a single website, that's it. We have 8GB of memory and I have JVM using 2048MB of it. I did update CF2018 to the latest update (2018.0.04.314546, ColdFusion 2018 Update 4). I'll continue to update this thread as my testing proceeds. – Sung Aug 07 '19 at 20:16
  • Sung, you refer to updating a CFC. That would NOT hang the server. Maybe you mean that hungup a request. Moving on, you then say "resolving the issue in production"..."just won't fly". Can you elaborate? Do you mean you feel you can't TRY TO RESOLVE IT in prod? Sure you can. Do you mean you don't run FR in production? To be clear, you can: it has no real impact. You MAY be able to recreate the problem in prod, but often you cannot. Please heed what I wrote. Your answer may already be there, in the logs, or the next steps I listed. You do not indicate if you tried any of them. I use them daily. – charlie arehart Aug 08 '19 at 21:12
  • Hi Charlie -- yeah, as the saying goes correlation does not equal causation. All I know is that I updated the CFC, refreshed the page, then realized my session had expired...why that would cause a hang, I have no idea. It probably didn't, but it's a test system and I'm the only one on it, so it couldn't have been anyone else. The reason why I can't put CF2018 in production is because if there's instability during testing, I just can't hazard it. What if there's something in the code that's causing this, and I don't know what it is? We have many transactions happening throughout the day... – Sung Aug 10 '19 at 21:55
  • ...and if the server goes down, we'd be in big trouble. I do run FR on prod -- I've been running FR for a long, long time. It's a great tool, but if there's something that causes a hang, it's not easy to troubleshoot while requests are queuing like mad (like 200 requests queuing within a few minutes). I've been there, and repeatedly having to restart CF service is no way to live! I really appreciate your help here -- I'll keep testing and see if I can repeat the issue. – Sung Aug 10 '19 at 21:58
  • How did things resolve? To be clear, when I asked why trying to resolve the problem in prod "would not fly", I thought you were saying it WAS happening in prod. I see now that your original request mentioned test. Sorry I missed that. I was simply saying (and still would) that any hangup or grinding halt of CF (or any java server) can be diagnosed (in test or in prod) using the steps I listed. If it is still happening, please try them, and let us know how it goes. – charlie arehart Aug 16 '19 at 19:25
  • Hi Charlie -- we are continuing to test CF2018 after upgrading our CF11 server to Update 19, which gives us more time. I'm still trying to replicate the error in our lower environments... – Sung Aug 17 '19 at 20:03
  • Ok, but if it IS happening in prod, and you have FR there, you should be able to solve it with FR. I know your current stance would deny that. You feel that when the stuff hits the fan, you can't take time then to diagnose the problem I get it. That's not what you need to do. See a webinar I did on post-crash troubleshooting, at fusion-reactor.com/webinars. Or again, I can help remotely with satisfaction guaranteed: you won't pay for time that's not valuable. More at carehart.org/consulting. – charlie arehart Aug 18 '19 at 21:33
  • Thanks, Charlie. I've been using CF for quite some time (since v3.1) and outside of a couple of issues (the changeover from C to Java was the toughest), it's been rock solid. Right now, because we have OS upgrades coming up, I think the tack I'll take is to run parallel between the old system and the new system to make sure all will be well. I really appreciate all of your great suggestions here! – Sung Aug 20 '19 at 00:59