2

I am building Morton number for spatial indexing, I have 8 unsigned 16 bit numbers that will turn into __int128 number. The efficiency is crucial, so naive solution (loop over everything) or building separate 8 128bit numbers is too expensive.

I am using GCC, the target machine is 64 bits but without BMI2 support.

How can I speed up the computation?

Paul R
  • 208,748
  • 37
  • 389
  • 560
Evil
  • 460
  • 1
  • 11
  • 25
  • [This](http://programming.sirrida.de/bit_perm.html#shuffle) may be of some interest. – Matteo Italia Jun 15 '17 at 05:55
  • @MatteoItalia thank you. Yes, I am aware of that, unfortunately by no BMI2 I do not have PDEP or PEXT instructions and am looking for calculating more codes at once. – Evil Jun 15 '17 at 06:00

1 Answers1

3

If your machine is x86 and supports SSE2, there is a clever answer using movmsk instructions. Google SSE2 bit matrix transpose for full code.

Mischa
  • 2,240
  • 20
  • 18
  • Yes, it does, x86_64. That is one clever idea. I'll wait for a while and then accept if nothing faster shows up. I assume [this SSE2 code](https://github.com/mischasan/sse2) is yours? Thank you. – Evil Jun 15 '17 at 06:07
  • Yes. Sorry I am posting from a phone hence redirecting to the article. The code would be performant even for your narrow "matrix". – Mischa Jun 15 '17 at 06:11