0

So I have some heavyweight algorithms which I would prefer to run on the VPU, but since there's so much going on, the VPU's tend to get saturated. Is there anyway to somehow do something like "Use VPU, if VPU overloaded, use FPU instead" so I have maximal throughput?

Thanks

user1181950
  • 779
  • 1
  • 9
  • 21
  • http://stackoverflow.com/questions/16463567/sse-fpu-parallel – phuclv Sep 12 '14 at 04:01
  • Thanks, sorry I missed that. Actually question about comments made there. I have a function for clamping, which if I do an isolated test, it's 4 times faster on SSE vs FPU. But if I replace it (only that clamp function) in the entire program, the overall program is slower on Clamp SSE vs Clamp FPU. What are some possible reasons for that?? Since FPU and SSE use the same units, the fact it's faster in isolation means it should still be faster as part of a bigger program? – user1181950 Sep 12 '14 at 04:29
  • I don't think this is possible because they share the same execution unit meaning that you cannot explicitly tell the CPU to run both of them at once. Lastly, x87 and other enhanced instructions are not hardware construct, I would not be surprised if part of the circuitry overlaps. – Mikhail Aug 02 '15 at 07:16

1 Answers1

0

re: comment. Either it's a problem with mixing SSE & AVX without vzeroupper (maybe you compiled the rest of your code with -march=native or something, and double-precision math is using AVX). Or your SSE version is bigger, and causes I-cache misses.

Or maybe your microbenchmark was bogus, and some of your SSE routine was optimized away.

To answer this, a lot more details about your code is needed. Like if you're sure your FPU code was really x87, and not just scalar in SSE.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847