Without getting into too much details, the answer is: do what ever is best for the readiblity of the code or the logic of your code (or function in question).
Saying, that it makes no difference, would be not completely honest - there are probably some cases where a non-negligible difference in running times can be measured - but most probably it will not be the case for you.
If you expect, that the function will be inlined - there will be no difference at all in the end: after inlining the optimizer will transform the code to the same binary (I have added an example illustrating this at the end of the post). The inlining is what one should try to achieve in such cases: not only does it save call-overhead but also makes optimizations possible which would not be possible otherway (here is a simple example, where inlining got running time from O(n)
to O(1)
).
If the code will not be inlined, then the result depends on the used ABI - but most probably, the second version will lead to a slightly more performant binary - yet the advantage is quite neglegible in most of the cases.
Here I'm taking a look at 64bit-Linux (which uses System V AMD64 - ABI). The Cython will translate your example to effectivily following C-code:
struct Vec3{
double x, y, z;
};
struct Vec3 vadd_v1(struct Vec3* a, struct Vec3* b){
struct Vec3 out;
out.x = a->x + b->x;
out.y = a->y + b->y;
out.z = a->z + b->z;
return out;
}
void vadd_v2(struct Vec3* a, struct Vec3* b, struct Vec3* out){
out->x = a->x + b->x;
out->y = a->y + b->y;
out->z = a->z + b->z;
}
When compiled with optimization on it will lead to the following assemblers (here a little bit resorted to be able to compare better):
vadd_v1: vadd_v2:
;out.x = a->x + b->x; ;out.x = a->x + b->x;
movsd (%rsi), %xmm2 movsd (%rdi), %xmm0
addsd (%rdx), %xmm2 addsd (%rsi), %xmm0
movsd %xmm2, (%rdi) movsd %xmm0, (%rdx)
;out.y = a->y + b->y; ;out.y = a->y + b->y;
movsd 8(%rsi), %xmm1 movsd 8(%rdi), %xmm0
addsd 8(%rdx), %xmm1 addsd 8(%rsi), %xmm0
movsd %xmm1, 8(%rdi) movsd %xmm0, 8(%rdx)
;out.z = a->z + b->z; ;out.z = a->z + b->z;
movsd 16(%rsi), %xmm0 movsd 16(%rdi), %xmm0
addsd 16(%rdx), %xmm0 addsd 16(%rsi), %xmm0
movsd %xmm0, 16(%rdi) movsd %xmm0, 16(%rdx)
;return ;return
movq %rdi, %rax
ret ret
An object of type Vec3
is of type MEMORY because it has 3 double-values (the whole algorithm can be looked up in the ABI). Thus, in the first version the caller is responsible to allocate memory for the return-value and pass its address in the "hidden pointer" %rdi
As one can see, the first version has an additional movq %rdi, %rax
because the pointer the the returned object must be returned in %rax
, as specified by the ABI:
If the type has class MEMORY, then the caller provides space for the return value and passes the address of this storage in %rdi as if
it were the first argument to the function. In effect, this address
becomes a “hidden” first argument. This storage must not overlap any
data visible to the callee through other names than this argument.
On
return %rax will contain the address that has been passed in by the
caller in %rdi.
Obviously, the second version is more efficient, but will this one instruction really matter?
However, there are also some examples, where the first version would be more efficient.
If we would use a struct of two doubles rather than a struct of three - the first version would need less instructions: the result is no longer of type MEMORY and will be passed in registers (once again resorted for comparison):
vadd_v1: vadd_v2:
;out.y = a->y + b->y; ;out.y = a->y + b->y;
movsd (%rdi), %xmm0 movsd (%rdi), %xmm0
addsd (%rsi), %xmm0 addsd (%rsi), %xmm0
movsd %xmm0, (%rdx)
;out.y = a->y + b->y; ;out.y = a->y + b->y;
movsd 8(%rdi), %xmm1 movsd 8(%rdi), %xmm0
addsd 8(%rsi), %xmm1 addsd 8(%rsi), %xmm0
movsd %xmm0, 8(%rdx)
;return ;return
ret ret
However, there might be additional costs, depending on how the functions in question are called. When one returns the value instead of passing a pointer - one should stick with it:
struct Vec3 use_v1(struct Vec3 *in){
return vadd_v1(in, in);
}
leads to an assembler without copying of returned data:
use_v1:
pushq %r12
movq %rsi, %rdx
movq %rdi, %r12
call vadd_v1
movq %r12, %rax
popq %r12
ret
While
void use_v2(struct Vec3 *in, struct Vec3 *out){
*out = vadd_v1(in, in);
}
would lead to
use_v2:
pushq %rbx
movq %rdi, %rdx
movq %rsi, %rbx
movq %rdi, %rsi
subq $32, %rsp
movq %rsp, %rdi
call vadd_v1
movdqu (%rsp), %xmm0 ;copying
movq 16(%rsp), %rax ;copying
movups %xmm0, (%rbx) ;copying
movq %rax, 16(%rbx) ;copying
addq $32, %rsp
popq %rbx
ret
where the result of vadd_v1
is created on the stack and is then copied to the pointer out
. It must must be done this way, because out
cannot be passed as "hidden pointer" to vadd_v1
, as the complier doesn't know whether out
is used somewhere in vadd_v1
or not (for example as a global variable). There is a SO-question, which look at the above function in more detail.
The advantage of using the pointer-version, unless there is a compiler-bug: you can be pretty sure that no copying is happening.
Here is an example, that when inlined, both versions lead to the same binary:
double sum_v1(struct Vec3* a){
struct Vec3 d = vadd_v1(a,a);
return d.x;
}
double sum_v2(struct Vec3* a){
struct Vec3 d;
vadd_v2(a, a, &d);
return d.x;
}
leads when compiled to the same assembler:
sum_v1/sum_v2:
movsd (%rdi), %xmm0
addsd %xmm0, %xmm0
ret