I am building Morton number for spatial indexing, I have 8 unsigned 16 bit numbers that will turn into __int128 number. The efficiency is crucial, so naive solution (loop over everything) or building separate 8 128bit numbers is too expensive.
I am using GCC, the target machine is 64 bits but without BMI2 support.
How can I speed up the computation?