I have a program which takes advantages of OpenMP for obtaining a great speed up on a dual CPU with a total of 32 cores server. The input parameters which I'm using doesn't allow for complete loading of the CPUs.
Today a couple of cores were 100% loaded by another program. When I launched my program it was terribly slow even if the load on the CPUs was as usual pretty high (~2500%). I removed the parallel instructions and I noticed some performance improvements.
Can this been due to the limited memory bandwidth? How could I further investigate the issue and eventually improve my code?