Want to understand load imputed by floating point instructions

Question

At outset, this may be a part-discussion part-solving kind of questions. No intent to offend anyone there.

I have written in 64 bit assembly the algorithm to generate MT Prime based random number generator for 64 bits. This generator function is required to be called 8 billion times to populate an array of size 2048x2048x2048, and generate a random no between 1..small_value (usually, 32)

Now I had two next steps possibilities :

(a) Keep generating numbers, compare with the limits [1..32] and discard those that don't fall within. The run time for this logic is 181,817 ms, measured by calling clock() function.

(b) take the 64 bit random number output in RAX,and scale it using FPU to be between [0..1], and then scale it up in the desired range [1..32] The code sequence for this is as below :

 mov word ptr initialize_random_number_scaling,dx
 fnclex             ; clears status flag
 call generate_fp_random_number ; returns a random number in ST(0) between [0..1]
 fimul word ptr initialize_random_number_scaling ; Mults ST(0) & stores back in ST(0)
 mov word ptr initialize_random_number_base,ax ; Saves base to a memory
 fiadd word ptr initialize_random_number_base  ; adds the base to the scaled fp number
 frndint                            ; rounds off the ST(0)
 fist word ptr initialize_random_number_result ; and stores this number to result.
 ffree st(0)               ; releases ST(0)
 fincstp                       ; Logically pops the FPU
 mov ax, word ptr initialize_random_number_result       ; and saves it to AX

And the instructions in generate_fp_random_number are as below :

 shl rax,1  ; RAX gets the original 64 bit random number using MT prime algorithm
 shr ax,1   ; Clear top bit
 mov qword ptr random_number_generator_act_number,rax ; Save the number in memory as we cannot move to ST(0) a number from register
 fild   qword ptr random_number_generator_max_number    ; Load 0x7FFFFFFFFFFFFFFFH
 fild   qword ptr random_number_generator_act_number    ; Load our number
 fdiv   st(0),st(1) ; We return the value through ST(0) itself, divide our random number with max possible number
 fabs
 ffree st(1)    ; release the st(1)
 fld1           ; push to top of stack a 1.0
 fcomip st(0), st(1)    ; compares our number in ST(1) with ST(0) and sets CF.
 jc generate_fp_random_get_next_no ; if ST(0) (=1.0) < ST(1) (our no), we need a new no
 fldz               ; push to top of stack a 0.0
 fcomip st(0),st(1) ; if ST(0) (=0.0) >ST(1) (our no) clears CF
 jnc generate_fp_random_get_next_no ; so if the number is above zero the CF will be set
 fclex

The problem is, just by adding these instructions, the run time jumps to a whopping 5,633,963 ms! I have written the above using xmm registers as an alternative, and the difference is absolutely marginal. (5,633,703 ms).

Would anyone kindly guide me on what degree of load do these additional instructions impute to the total run time? Is the FPU really this slow ? Or am I missing a trick? As always, all ideas are welcome and am grateful for your time and efforts.

Env : Windows 7 64 bit on Intel 2700K CPU overclocked to 4.4 GHz 16 GB RAM debugged in VS 2012 Express environment

score 0 · Answer 1 · answered May 25 '13 at 07:51

"mov word ptr initialize_random_number_base,ax ; Saves base to a memory"

If you want the max speed you must find out how to separate write instructions and write data into different sections of memory

Rewriting data in the same area of cache creates a "self modifying code" situation

Your compiler may do this, it may not. You need to know this because unoptimised assembly code runs 10 to 50 times slower

"All modern processors cache code and data memory for efficiency. Performance of assembly-language code can be seriously impaired if data is written to the same block of memory as that in which the code is executing, because it may cause the CPU repeatedly to reload the instruction cache (this is to ensure self-modifying-code works correctly). To avoid this, you should ensure that code and (writable) data do not occupy the same 2 Kbyte block of memory. "

http://www.bbcbasic.co.uk/bbcwin/manual/bbcwina.html#cache

this is not a self modifying code (also completely written in assembly) - the variable initialize_random_number_base is declared as a define word in .data segment, which is significantly larger than 2K. However the caution is indeed useful. Thanks for that. I therefore, added some unnecessary space to data segment, so separate it from code segment by more than 2K. — quasar66, May 25 '13 at 12:04

score 0 · Answer 2 · answered May 25 '13 at 08:20

There's a ton of stuff in your code that I can see no reason for. If there was a reason, feel free to correct me, but otherwise here are my alternatives:

For generate_fp_random_number

shl rax, 1
shr rax, 1
mov qword ptr act_number, rax
fild qword ptr max_number
fild qword ptr act_number
fdivrp   ; divide actual by max and pop
; and that's it. It's already within bounds.
; It can't be outside [0, 1] by construction.
; It can't be < 0 because we just divided two positive number,
; and it can't be > 1 because we divided by the max it could be

For the other thing:

mov word ptr scaling, dx
mov word ptr base, ax
call generate_fp_random_number
fimul word ptr scaling
fiadd word ptr base
fistp word ptr result  ; just save that thing
mov ax, word ptr result
; the default rounding mode is round to nearest,
; so the slow frndint is unnecessary

Also note the complete lack of ffree's etc. By making the right instruction pop, it all just worked out. It usually does.

thanks for the advise - yes, I was perhaps getting too conservative about the limits etc. Also, had not taken a serious look at the fdivrp - have made thse changes and fired the process - shall update timing once it completes. — quasar66, May 25 '13 at 12:00

Want to understand load imputed by floating point instructions

2 Answers2