0

There are so many threads on whether you should mess with the page file or not. This scenario describes a unique circumstance that is real world in my production enviornment. The conclusion I've come to in order to fix my problem is to disable the page file.

I'm running a series of guest VMs all of which Server 2003 Enterprise Edition (inorite?). For my physical hosts, I'm running HP DL380 G7's loaded with VMware's ESXi 5.0 (managed via vCenter). For storage I have an HP P2000 G3 SAS array loaded with 16 300 GB 10k SAS Drives in RAID 6, call it LUN01. These virtual servers make up our Wonderware environment with a single SQL server and Historian, two application servers, and two terminal servers.

The work that this stack performs is mission critical, and determines whether the facility can serve its function or not. (i.e., when server goes down, the business goes down) Recently, several disk failures in the P2000 array caused me to rethink the architecture from the ground up. Reconstructing disks in the array severely hurt the performance to the point where the wonderware app became completely unresponsive. Since these VMs all run I/O intensive applications, and RAID reconstruction places such a high demand on a RAID.

I've determined that the bottleneck during disk reconstruction occurs because of application server disk writes. Seemingly because its using the system page file instead of RAM. The amount of network I/O thus becomes directly linked to disk I/O. Consequently severe performance impact on the disks during reconstruction directly impacts APP server I/O. It makes very little sense why its designed this way, but it perfectly explains why a server that stores nothing locally (an app server) would sustain 10Mbps disk write rate (vmware performance statistics for the app server VM).

So... what I'm thinking is given the circumstances I want to disable the page file in the guest OS (server 2003 EE) to prevent the deployed wonderware app engine from creating such high disk I/O demands... and as a result lessen the impact of future disk reconstructions in the RAID.

  • What do you think?
  • Does this justify disabling the page file?
  • Am I overlooking another solution to minimize the performance impact of raid reconstruction?
Lucretius
  • 459
  • 1
  • 4
  • 14
  • If the application "primarily operate[s] using the system page file instead of RAM" (and how did you determine that?), then how do you think the system will work if you disable the pagefile? I'm guessing poorly. – mfinni May 02 '13 at 21:36
  • It should have been phrased "seems to operate primarily using the system page", and my digging has been mostly using process monitor and vmware performance statistics. – Lucretius May 02 '13 at 22:14

2 Answers2

1

I don't know Wonderware, but if you're using the pagefile then you're out of memory and everything is continuing more slowly using virtual memory - disabling the page file won't necessarily fix that, it could well just make everything run out of memory and crash instead.

1) Buy more RAM for the hosts, or configure more RAM in the guests.

2) Or configure the application to use less memory.

3) Or more usefully, run something like PSInternals' ProcMon to see what's actually being written to disk in the guests, and confirm your suspicions.

4) If you can run a similarly configured test server on Windows Server 2008 R2, the task manager shows disk access in much more detail than 2003 (process, file, response time) without the huge log file of Process Monitor.

It makes very little sense why its designed this way, but it perfectly explains why a server that stores nothing locally (an app server) would sustain 10Mbps disk write rate (vmware performance statistics for the app server VM).

Application logfiles? Temporary files such as report or rendering templates and their output? Transaction logs for everything passing through the application? State synchronisation between the two application servers? Rogue antivirus scanner? Corrupt filesystem filter driver? Malware?

TessellatingHeckler
  • 5,726
  • 3
  • 26
  • 44
1

I was able to figure this out with a lot of on the phone time with Wonderware. Basically inside each App Engine deployed to the Galaxy there is a configurable parameter called the "Checkpoint Period."

The Checkpoint Period is the period of time between when Archestra will write the current state (values, variables, etc...) of the application to disk. It does this so that in the event of a server reboot or system crash, the application can resume from its most recent state without data loss. If your application is designed to store values in galaxy objects themselves you have to weigh out how much data loss you can tolerate. If your application is designed to merely process data, and offloads the job of storing information to a SQL server or leaves the values in a Tag Database, then you don't risk losing any data by increasing this value.

ArchestrA currently has about 9000 tags. What this means is that between any two seconds, 9000 values could have changed resulting in 9000 values to write to disk... every second. Most of these values overwrite values that were stored the previous second. Systems that are designed to monitor analogue inputs will always have a massive number of changes every second. As an admin you have to decide how much of that is noise and how much of that data needs to be captured for trending/tracking etc...

Increasing the default value of 0 ms (which the system interprets as "no default specified, use 1 second") to 5000 ms dropped my disk activity from over 300 IOPs to less than 25 IOPs. We actually staggered each App Engine with a prime number near 5000 ms so that each engine's Checkpoint Period would make independent requests to the disks for I/O activity. This is particularly important for virtualization of controls systems. Performance and scalability become an issue when you have many servers running on the same array.

Lucretius
  • 459
  • 1
  • 4
  • 14