We are running an app in JBoss EAP on RHEL 6.7 and are experiencing significantly different performance across the cluster. Of the 8 VMs, most will respond to requests in ~200ms but one or two will have response times of 2 or 4 seconds.
Investigating the issue, we observed from vmstat that the slower servers report hundreds of thousands of system interrupts every 5 seconds compared to a few thousand on the fast servers. Moving to /proc/interrupts we saw that the interrupts were TLB Shootdowns. 100k to 200k of them every few seconds.
I've done some reading to understand what these are (I like this description best). But I don't know where to look next. Why are the TLB interrupts being issued?