Going down one step from the load balancer and into the individual servers.
Whatever the application it needs to economize on the number of threads it utilizes such that the OS' scheduler spends as few CPU-cycles as possible determining the highest priority thread. One mechanism for doing this are I/O Completion Ports which are found in Windows and some flavors of Unix. Search for IOCP here on SO.
Economizing on accesses to shared resources - communications, databases, buses, RAM and the L3 cache to name a few - and trying to fit thread and data inside non-shared resources - L2 and L1 caches - results in an application that will be more scalable than if these accesses are ignored. There are many examples of multi-threaded applications running slower than single-threaded ones.
Determining what a SOAP- or an XML-formatted request is supposed to do is very CPU-intensive - the more text the bigger the job. If the application utilizes binary requests it will have more resources left over for performing the request and spend less understanding the request itself. Another aspect of verbose requests and responses are the fact that they gobble up communication bandwidth. A one megabyte response requires roughly ten megabits of bandwidth. That's one tenth of a 100 Mbps connection's capacity during one second. As such it will limit your response capacity to at best 10 responses every second. You want one thousand? You need responses no longer than 10 kB.
If your application is fast enough it will be held up if it needs to go to another server to execute parts of the request. This holds true even for fiber inter-connects. SANs are slower than physically attached storage.