Representing server state with a metric

Question

I'm using Microsoft's Performance Monitor to dump logs of RAM, CPU, network, and disk usage from multiple servers. I'd like to get a single metric that captures the state of a given variable to a good extent. For instance, disk usage is pretty stable, so if I take a single reading that says I have 50% remaining disk space, that reading will give me an accurate measure for the day. (The servers aren't doing heavy IO writing.)

However, the tricky part here is monitoring CPU and network usage. The logs currently dump the % CPU usage every ten seconds. If I take a straight average of the numbers, it may not represent reality, as % CPU will be much lower during the night than day. (We host websites that sell appliance items.) I'd like to get an average over a span during peak hours (about 5 hours in the day) and present a daily peak hour metric. Of course, there are most likely some readings that will come in as overly spiked (if multiple users pinged the server at once) or no use (a momentary idle state). Is there a standard distribution/test industries use in these situation?

Evan Anderson · Accepted Answer · 2012-11-13T22:18:08.480

I don't think there's a simple answer for you on this. Taking the 90th or 95th percentile of sampled data is a typical technique used to remove "spikes". I don't know that just removing "spikes" from your data, though, is going to really be useful. The raw performance data doesn't actually tell you how your application is responding.

Personally, I'm more concerned with the actual response time of the application being within the stated SLA, rather than the raw performance metrics of the server computer. I prefer to measure the actual application performance, whenever possible, and then correlate application response issues to raw data, rather than trying to use raw data as my only metric. Raw data is great for root cause analysis but application performance is typically influenced in a non-linear manner by raw performance metrics. Nothing tells you that application performance is lagging better than measuring application performance.

Embracing designed-in profiling (like Stack Exchange did with MiniProfiler) is important to being able to adequately correlate raw performance metrics with application performance. Having a wget script that periodically times an API call against the application might be a good first step, but having visibility to profiling data coming from inside the application is going to help your developers and sysadmins work together to match raw performance data to actual application performance.

You're absolutely right that analysis of the raw data alone won't sufficiently gauge the performance of a program. In addition to the analysis of raw data, we do performance tests across our websites at different intervals, but that was a bit more straightforward to implement, so I didn't mention it. I'm in favor of doing some straightforward 95th percentile analysis, but how do industries generally break down time zones? Is this taken into consideration? Is it as simple as delegating different blocks? — Sal, Nov 13 '12 at 22:21
I'm a little unclear with what you're asking. CPU utilization, for example, is CPU utilization irrespective of the time that it occurs. It's unclear to me why you'd have different analysis techniques for data sampled at different times. I'd think that would, in fact, defeat the purpose of creating a simple metric because the metric's meaning would change depending on the time that the metric is representing. I dunno-- I guess I'm not grokking what you're asking for. Utilization is what it is, regardless of when it happens. — Evan Anderson, Nov 13 '12 at 22:25

Representing server state with a metric

1 Answers1