I will include only parts of generated code. Asm:
mov r15, 10000000 ; 10 millions
now ; get time
lbl:
dec r15
include "temp_code.asm"
cmp r15, 0
jne lbl
now
section '.data' data readable writeable
include "temp_data.asm"
temp_code.asm contains
mov rbx, 0
mov rax, [numbers0 + 0 * 8]
mov rcx, [numbers0 + 1 * 8]
imul rax, rcx
add rbx, rax
mov rax, [numbers0 + 2 * 8]
mov rcx, [numbers0 + 3 * 8]
imul rax, rcx
add rbx, rax
...
mov rax, [numbers99 + 18 * 8]
mov rcx, [numbers99 + 19 * 8]
imul rax, rcx
add rbx, rax
mov rax, rbx
4200 lines in total which corresponds to 100 lines of python.
temp.data contains
numbers0 dq 103,253,479,962,468,91,543,382,761,923,292,696,255,35,726,141,282,260,727,110
...
numbers99 dq 445,543,544,833,136,474,12,337,652,34,68,916,184,839,263,373,590,342,214,984
these are random numbers from 0 to 1000
Corresponding pypy code:
def f():
temp0 = now()
# m is array containing 2000 random numbers
for i in range(100000): 100 thousand
m[0]*m[1]+m[2]*m[3]+m[4]*m[5]+m[6]*m[7]+m[8]*m[9]+ m[10]*m[11]+m[12]*m[13]+m[14]*m[15]+m[16]*m[17]+m[18]*m[19]
...
m[1980]*m[1981]+m[1982]*m[1983]+m[1984]*m[1985]+m[1986]*m[1987]+m[1988]*m[1989]+m[1990]*m[1991]+m[1992]*m[1993]+m[1994]*m[1995]+m[1996]*m[1997]+m[1998]*m[1999]
temp1 = now()
print(temp1 - temp0)
Asm code runs 6.755 seconds, pypy - 30.109 seconds, so PyPy is 446 times slower (yes, 446 - it iterates just 100000 times versus 10000000 in asm). I cannot believe my eyes. What is happening here?
Edit: even naive interpreter which I wrote for my dynamic language runs just 260 times slower than python, in which it is written. But CPython (which run this benchmark for 8.978 seconds - 3.35 times faster than PyPy) is half-compiler - it compiles to bytecode.