26

I'm running a camera acquisition program that performs processing on acquired images, and I'm using simple OpenMP directives for this processing. So basically I wait for an image from the camera, and then process it.

When migrating to VC2010, I see very strange performance hog : under VC2010 my app is taking nearly 100% CPU while it is taking only 10% under VC2008.

If I benchmark only the processing code I get no difference between VC2010 and VC2008, the difference occurs when using the acquisition functions.

I have reduced the code needed to reproduce the problem to a simple loop that does the following:

  for (int i=0; i<1000; ++i)
  {
    GetImage(buffer);//wait for image
    Copy2Array(buffer, my_array);

    long long sum = 0;//do some simple OpenMP parallel loop
    #pragma omp parallel for reduction(+:sum)
    for (int j=0; j<size; ++j)
      sum += my_array[j];
  }

This loop eats 5% of CPU with 2008, and 70% with 2010.

I've done some profiling, that shows that in 2010 most of the time is spent in OpenMP's vcomp100.dll!_vcomp::PartialBarrierN::Block

I have also done some concurrency profiling:

In 2008, processing work is distributed over 3 worker threads, that are very lightly active as processing time is much inferior as image waiting time

The same threads appear in 2010, but they are all 100% occupied by the PartialBarrierN::Block function. As I have four cores, they are eating 75% of the work, which is roughly what I see in the CPU occupation.

So it looks like there is a conflict between OpenMP and the Matrox acquisition library (proprietary). But is it a bug of VS2010 or Matrox? Is there anything I can do? Using VC++2010 is mandatory for me, so I cannot just stick with 2008.

Big thanks

STATUS UPDATE

Using new concurrency framework, as suggested by DeadMG, leads to 40% CPU. Profiling it shows that time is spent in processing, so it doesn't show the bug I'm seeing with OpenMP, but performance in my case is way poorer than OpenMP.

STATUS UPDATE 2

I have installed an evaluation version of latest Intel C++. It shows exactly the same performance problems!!

I cross-posted to MSDN forum

STATUS UPDATE 3

Tested on Windows 7 64 bits and XP 32 bits, with the exact same results (on the same machinje)

Community
  • 1
  • 1
CharlesB
  • 86,532
  • 28
  • 194
  • 218
  • 3
    It consumes 100% of the CPU, but how long does it take? Does it run faster? – James McNellis Jan 19 '11 at 17:01
  • No, it doesn't run faster. In both cases I can process the image before a new one arrives, so if processing runs faster I won't see it in my program. My problem is more the CPU hog than processing time. – CharlesB Jan 19 '11 at 17:06
  • 1
    Using OpenMP and *not* seeing it use all the processor resources is the real problem. Getting 100% load is the expected outcome. More here: http://blogs.msdn.com/b/oldnewthing/archive/2010/12/03/10097861.aspx – Hans Passant Jan 19 '11 at 18:47
  • Sorry Hans I don't get it? I have 100% load while in the previous situation less than 5% load. Thus it is excessive and abnormal. In my real program I need the CPU to do other processing, so it slows the application. – CharlesB Jan 19 '11 at 19:15
  • 3
    @Hans Passant: that logic applies only to fixed-load programs. This is a fixed-rate program: you get one camera image every 40 ms (typically). There is no way you can finish early. Every 40 ms the CPU use should spike. With MP to 100%, yes. But the spike should be shorter, and averaged over 40 ms the result should be about 5%. – MSalters Jan 20 '11 at 12:21

4 Answers4

19

In 2010 OpenMP, each worker thread does a spin-wait of about 200 ms after task completion. In my case of a I/O wait and repetitive OpenMP task it is massively loading the CPU.

The solution is to change this behaviour; Intel C++ has an extension routine for this, kmp_set_blocktime(). However Visual 2010 doesn't have such possibility.

In this Autodesk note they talks about the problem for Intel C++. This compiler first introduced the behavior, but allows to change it (see above). Visual 2010 switched to it, but... without the workaround like Intel.

So to sum it up, switching to Intel C++ and using kmp_set_blocktime(0) solved it.

Thanks to John Lilley from DataLever Corporation on the other MSDN thread

Issue has been submitted to MS Connect, and received the "won't fix" feedback.

CharlesB
  • 86,532
  • 28
  • 194
  • 218
  • 1
    Set spin block time to some small value, e.g. 20, not straight zero. Intel OpenMP freezes in some cases, if value is exactly 0. I've encountered this issue in many occasions over the years. – Pavel Holoborodko Nov 10 '16 at 05:31
6

With OpenMP 3.0 the spinwait can be deactivated via OMP_WAIT_POLICY:

_putenv_s( "OMP_WAIT_POLICY", "PASSIVE" );

The effect is basically the same as with kmp_set_blocktime(0), but as we set the environment variable OMP_WAIT_POLICY during runtime, it'll only affect the current process and child processes.

Of course OMP_WAIT_POLICY can also be set by a launcher application, e.g. Blender handles it that way.

A hotfix for VC2010 is available here, later versions like VC2013 support it directly.

user3671833
  • 61
  • 1
  • 2
4

You could try the new Concurrency Runtime that ships with VS2010- just starting on your test sample.

That is,

for (int i=0; i<1000; ++i)
  {
    GetImage(buffer);//wait for image
    Copy2Array(buffer, my_array);

    long long sum = 0;//do some simple OpenMP parallel loop
    #pragma omp parallel for reduction(+:sum)
    for (int j=0; j<size; ++j)
      sum += my_array[j];
  }

would become

for (int i=0; i<1000; ++i)
  {
    GetImage(buffer);//wait for image
    Copy2Array(buffer, my_array);

    Concurrency::combinable<int> combint;
    Concurrency::parallel_for(0, size / 1000, [&](int j) {
      for(int i = 0; i < 1000; i++)
          combint.local() += my_array[(j * 1000) + i];
    });
    combint.combine([](int a, int b) { return a + b; });
  }
Puppy
  • 144,682
  • 38
  • 256
  • 465
1

I tested another acquisition board, and the problem is identical, so the culprit is VC++2010. Microsoft made OpenMP implementation changes that screws up programs like mine, as a thread on MSDN forums shows.

CharlesB
  • 86,532
  • 28
  • 194
  • 218