In terms of the speed of incrementing a (custom) performance counter, I understand performance counters to be lock-free, processor primitives. I suspect this means they can execute in the space of a few dozen CPU cycles - which means they are so fast, they virtually impossible to benchmark. Correct?
In terms of the memory consumption of creating a custom performance counter, I've heard from a colleague that they require about 128kb or more per counter (from global shared, or seperate shared memory). I can't believe that number. It makes very little sense. I could perhaps believe 2k to 8k, but I'd like to know if anyone has more accurate information on this?