2

TLDR: I am interested in finding out whether there is a reason to select an additive approach to troubleshooting over a subtractive approach, or vice-versa when trying to troubleshoot a problem with many variables.

Problem overview:

I am working with a group of people trying to troubleshoot an intermittent but high impact issue in a staging system that is preventing us from going live with this new configuration.

We have a Citrix XenApp application server running on a virtualized infrastructure serving applications to clients running at remote sites over a WAN. There are several encryption/security/firewall devices at the head end of the network between the WAN and the physical server(s) hosting the virtualized servers.

So basically we have a problem with many variables and we are trying to troubleshoot it. So far we have started with a subtractive approach -- trying to remove one thing from the system at a time and trying to rule out that one thing if the problem goes away. We are not having much luck with this approach. I was thinking of suggesting an additive approach where we start out the bare minimum of system components that the app will work under, and start adding things in various combinations.

Based on your experience, are there reasons to prefer additive over subtractive troubleshooting or vice-versa?

Shane Wealti
  • 431
  • 5
  • 15

1 Answers1

3

I actually prefer an iterative debugging approach : you analyze using symptoms, log files, network captures, comparison of best practices and/or designed intent vs the built environment, etc, to try to find the actual problem. You'll find, in many cases, that there isn't only one cause of a reported performance problem. There may be more than one bottleneck, or the problem may be occurring because of the interaction of multiple components.

I find this works better than blindly removing things, or building a brand new environment just for troubleshooting. Although, the latter approach is excellent for planning and testing in advance : we would call that either a "dev", "test", or "staging" environment depending on how it fits into the overall architecture. It does require that you actually plan and test IN ADVANCE, however, and not every organization has the resources to do that properly.

mfinni
  • 36,144
  • 4
  • 53
  • 86
  • I appreciate the answer. At this point I have a feeling that it is a problem caused by the interaction of several components, and adding/removing one component at a time may or may not help us find the issue. – Shane Wealti Dec 01 '11 at 15:36