0

I will include only parts of generated code. Asm:

mov r15, 10000000 ; 10 millions
now ; get time
lbl:
dec r15

include "temp_code.asm"

cmp r15, 0
jne lbl
now

section '.data' data readable writeable
include "temp_data.asm"

temp_code.asm contains

mov rbx, 0
mov rax, [numbers0 + 0 * 8]
mov rcx, [numbers0 + 1 * 8]
imul rax, rcx
add rbx, rax
mov rax, [numbers0 + 2 * 8]
mov rcx, [numbers0 + 3 * 8]
imul rax, rcx
add rbx, rax

...

mov rax, [numbers99 + 18 * 8]
mov rcx, [numbers99 + 19 * 8]
imul rax, rcx
add rbx, rax
mov rax, rbx

4200 lines in total which corresponds to 100 lines of python.

temp.data contains

numbers0 dq 103,253,479,962,468,91,543,382,761,923,292,696,255,35,726,141,282,260,727,110

...

numbers99 dq 445,543,544,833,136,474,12,337,652,34,68,916,184,839,263,373,590,342,214,984

these are random numbers from 0 to 1000

Corresponding pypy code:

def f():
    temp0 = now()
    # m is array containing 2000 random numbers
    for i in range(100000): 100 thousand
        m[0]*m[1]+m[2]*m[3]+m[4]*m[5]+m[6]*m[7]+m[8]*m[9]+      m[10]*m[11]+m[12]*m[13]+m[14]*m[15]+m[16]*m[17]+m[18]*m[19]

        ...

        m[1980]*m[1981]+m[1982]*m[1983]+m[1984]*m[1985]+m[1986]*m[1987]+m[1988]*m[1989]+m[1990]*m[1991]+m[1992]*m[1993]+m[1994]*m[1995]+m[1996]*m[1997]+m[1998]*m[1999]
    temp1 = now()
    print(temp1 - temp0)

Asm code runs 6.755 seconds, pypy - 30.109 seconds, so PyPy is 446 times slower (yes, 446 - it iterates just 100000 times versus 10000000 in asm). I cannot believe my eyes. What is happening here?
Edit: even naive interpreter which I wrote for my dynamic language runs just 260 times slower than python, in which it is written. But CPython (which run this benchmark for 8.978 seconds - 3.35 times faster than PyPy) is half-compiler - it compiles to bytecode.

DSblizzard
  • 4,007
  • 7
  • 48
  • 76
  • 30.109 / 6.755 doesn't seem to be equal to 446 – Vladimir Kolenov Nov 05 '19 at 07:32
  • Asm: 10 millions iterations, pypy: 100 thousand – DSblizzard Nov 05 '19 at 07:33
  • 2
    PyPy is JIT, not AOT. If you want AOT compilation, use Cython. Running something once doesn't let a JIT do much. – user2357112 Nov 05 '19 at 07:39
  • 1
    Why the handcrafted list of multiplications in the Python code? Why not use a loop? – Some programmer dude Nov 05 '19 at 07:40
  • @Some programmer dude: to make it closer to asm version. And in asm I made it to negate effect of slowing by surrounding loop code (jne lbl etc.) – DSblizzard Nov 05 '19 at 07:44
  • 1
    Your asm is only half optimized, too. You could use memory-source operands for `imul` which will micro-fuse on Haswell and later, saving 1 front-end uop. You still have a back-end bottleneck on 1 `imul` per clock throughput, 2 loads per clock throughput, and the 1 cycle add latency dep chain, but at least it removes the front-end bottleneck so you're more likely to saturate the back-end. Also, use `dec r15 / jnz` at the bottom of the loop, otherwise you're defeating the purpose of counting down. The possible gains are probably only a couple % on a Haswell/Skylake. – Peter Cordes Nov 05 '19 at 08:10
  • And of course you could vectorize this with SIMD if you take advantage of the fact that the numbers are actually small and 32x32 -> 64-bit multiplies work for SSE2 `pmuludq` (otherwise you need AVX512 for single-insns 64x64=>64). Or just plain *use* 32-bit integers and accumulators with SSE4.1 `pmulld`, although that has half the throughput of `pmuludq` on Haswell and later. Still, the possible gains are a factor of at least 2 with SSE2, or 4 with AVX2, and actually more on Skylake where SIMD-integer multiply has better throughput (per uop) than scalar, running on the FP multiplier units. – Peter Cordes Nov 05 '19 at 08:12
  • 1
    Anyway, like @user2357112 says, give PyPy some warm-up runs to JIT a fully-optimized version of your code, rather than an instrumented first attempt that probably doesn't take advantage of the list objects being small integers, and/or other Pythonic overhead. – Peter Cordes Nov 05 '19 at 08:16

0 Answers0