2

Here my very simple question. With ICC I know it is possible to use #pragma SIMD to force vectorization of loops that the compiler chooses not to vectorize. Is there something analogous in GCC? Or, is there any plan to add this feature in a future release?

Quite related, what about forcing vectorization with Graphite?

phuclv
  • 37,963
  • 15
  • 156
  • 475

1 Answers1

1

As long as gcc is allowed to use SSE/SSE2/etc instructions, the compiler will in general produce vector instructions when it realizes that it's "worthwhile". Like most things in compilers, this requires some luck/planning/care from the programmer to avoid the compiler thinking "maybe this isn't safe" or "this is too complicated, I can't figure out what's going on". But quite often, it's successful if you are using a reasonably modern version of gcc (4.x versions should all do this).

You can make the compiler use SSE or SSE2 instructions by adding -msse or -msse2 (etc. for later SSE extensions). -msse2 is default in x86-64.

I'm not aware of any way that you can FORCE this, however. The compiler will either do this because it's happy that it's a good solution, or it wont.

Sorry, can't answer about Graphite.

Mats Petersson
  • 126,704
  • 14
  • 140
  • 227
  • Yes, I know what you mean. I just want to force some loops to be vectorized because if I do that with ICC, I get some performance improvement. So, I'm curious to see the reaction of GCC. But I need to discover whether it is possible and how to force vectorization. Thanks anyway. – user2047635 Feb 06 '13 at 16:36
  • @user2047635 If you're at the point where you think you can do better than the compiler, you might as well just manually vectorize it yourself with intrinsics. – Mysticial Feb 06 '13 at 17:38
  • Or better, yet, write it in assembler all the way - that way, you have 100% control over which instructions come in which order, what registers are used where, etc, etc. – Mats Petersson Feb 06 '13 at 17:42
  • You're both right. But things are not so simple. I am investigating a class of programs sharing a specific feature, i.e. a loop nest with very small trip counts. So using intrinsics means building a compiler/code translator/generator (call it however you prefer) to generate them, and this would be more complicated than that I would have to build for making the transformations I am currently doing (up to now manually, for experimental purposes) to the loops. – user2047635 Feb 06 '13 at 17:54
  • Have you actually looked at what the gcc compiler produces? – Mats Petersson Feb 06 '13 at 17:58
  • yep, it basically full unroll the loops and use scalar instructions (e.g. vmulsd ... notice the v in front of the instruction meaning I had compiled with -mavx) – user2047635 Feb 06 '13 at 18:07
  • Ah, even more funny, if I look at the vectorizer report it appears that the compiler completely skipped the (vectorization of) innermost loop. It prints out that it refused to vectorized some loops, but it doesn't print out anything for the innermost. I thought it's due to skipping loops with very short trip counts..but that wouldn't explain why reports something for the outer loops, which have exactly the same trip count – user2047635 Feb 06 '13 at 18:09
  • The AVX code may not be quite as mature as the SSE variant, so you may want to try it with -msse2 instead. – Mats Petersson Feb 06 '13 at 18:13
  • There are probably cost-model tuning options that could make GCC emit vectorized code that might be slower than scalar, or at least GCC thinks so. If it can find a way to vectorize at all; it can't for loops whose trip-count can't be calculated before the first iteration, e.g. search loops that implement strlen or memcmp. – Peter Cordes Feb 04 '23 at 08:51