This Julia function seems to be quite inefficient (an order of magnitude slower than the equivalent Pythran / C++ code, even after the Julia warmup)...
function my_multi_broadcast(a)
10 * (2*a.^2 + 4*a.^3) + 2 ./ a
end
arr = ones(1000, 1000)
my_multi_broadcast(arr)
I guess it is only that I don't write it correctly... How can one speedup such "multi broadcasts" in Julia? I guess/hope I don't need to expend the loops...
Edit after the first answer
Thank you! With my setup, the Pythran solutions (in place and out of place) are still 1.5 to 2 times faster (without OpenMP). Is there a way to activate SIMD instructions in Julia? Or another way to speed up such CPU computations?
The Python code:
from transonic import jit
@jit
def broadcast(a):
return 10 * (2*a**2 + 4*a**3) + 2 / a
@jit
def broadcast_inplace(a):
a[:] = 10 * (2*a**2 + 4*a**3) + 2 / a
Edit after the @simd
suggestion
It seems that @simd
does not work out of the box, i.e. just by adding it at the beginning of the line.
ERROR: LoadError: LoadError: Base.SimdLoop.SimdError("for loop expected")
Stacktrace:
[1] compile(::Expr, ::Bool) at ./simdloop.jl:54
[2] @simd(::LineNumberNode, ::Module, ::Any) at ./simdloop.jl:126
[3] include at ./boot.jl:317 [inlined]
[4] include_relative(::Module, ::String) at ./loading.jl:1044
[5] include(::Module, ::String) at ./sysimg.jl:29
[6] exec_options(::Base.JLOptions) at ./client.jl:231
[7] _start() at ./client.jl:425
I guess that one would have to expand the for loops, but then the code (i) becomes much less readable and (ii) is no longer independent of the dimension.
It seems that we have a case for which simple Python/Numpy code can get accelerated with Pythran faster than what we get with Julia (except if there is a way to accelerate this in Julia? and a future Julia version may solve this). Interesting...