At outset, this may be a part-discussion part-solving kind of questions. No intent to offend anyone there.
I have written in 64 bit assembly the algorithm to generate MT Prime based random number generator for 64 bits. This generator function is required to be called 8 billion times to populate an array of size 2048x2048x2048, and generate a random no between 1..small_value (usually, 32)
Now I had two next steps possibilities :
(a) Keep generating numbers, compare with the limits [1..32] and discard those that don't fall within. The run time for this logic is 181,817 ms, measured by calling clock() function.
(b) take the 64 bit random number output in RAX,and scale it using FPU to be between [0..1], and then scale it up in the desired range [1..32] The code sequence for this is as below :
mov word ptr initialize_random_number_scaling,dx
fnclex ; clears status flag
call generate_fp_random_number ; returns a random number in ST(0) between [0..1]
fimul word ptr initialize_random_number_scaling ; Mults ST(0) & stores back in ST(0)
mov word ptr initialize_random_number_base,ax ; Saves base to a memory
fiadd word ptr initialize_random_number_base ; adds the base to the scaled fp number
frndint ; rounds off the ST(0)
fist word ptr initialize_random_number_result ; and stores this number to result.
ffree st(0) ; releases ST(0)
fincstp ; Logically pops the FPU
mov ax, word ptr initialize_random_number_result ; and saves it to AX
And the instructions in generate_fp_random_number are as below :
shl rax,1 ; RAX gets the original 64 bit random number using MT prime algorithm
shr ax,1 ; Clear top bit
mov qword ptr random_number_generator_act_number,rax ; Save the number in memory as we cannot move to ST(0) a number from register
fild qword ptr random_number_generator_max_number ; Load 0x7FFFFFFFFFFFFFFFH
fild qword ptr random_number_generator_act_number ; Load our number
fdiv st(0),st(1) ; We return the value through ST(0) itself, divide our random number with max possible number
fabs
ffree st(1) ; release the st(1)
fld1 ; push to top of stack a 1.0
fcomip st(0), st(1) ; compares our number in ST(1) with ST(0) and sets CF.
jc generate_fp_random_get_next_no ; if ST(0) (=1.0) < ST(1) (our no), we need a new no
fldz ; push to top of stack a 0.0
fcomip st(0),st(1) ; if ST(0) (=0.0) >ST(1) (our no) clears CF
jnc generate_fp_random_get_next_no ; so if the number is above zero the CF will be set
fclex
The problem is, just by adding these instructions, the run time jumps to a whopping 5,633,963 ms! I have written the above using xmm registers as an alternative, and the difference is absolutely marginal. (5,633,703 ms).
Would anyone kindly guide me on what degree of load do these additional instructions impute to the total run time? Is the FPU really this slow ? Or am I missing a trick? As always, all ideas are welcome and am grateful for your time and efforts.
Env : Windows 7 64 bit on Intel 2700K CPU overclocked to 4.4 GHz 16 GB RAM debugged in VS 2012 Express environment