4

I am creating a python extension that deals with linear algebra and want it to be as fast as possible.

I have a simple Vector struct, in which I want to perform operations on:

cdef struct Vec3:
    double x, y, z

I am currently split between two kinds of signature for the functions in my module, the first one takes only the inputs as arguments and returns a new vector and the other one, returns the modified data to an argument called out

cdef inline Vec3 vadd(Vec3* a, Vec3* b) nogil:
    cdef Vec3 out
    out.x = a.x + b.x
    out.y = a.y + b.y
    out.z = a.z + b.z
    return out

cdef inline void vadd(Vec3* a, Vec3* b, Vec3* out) nogil:
    out.x = a.x + b.x
    out.y = a.y + b.y
    out.z = a.z + b.z

I've seen both ways in many examples and have no clue of which one is better in terms of speed.

Are they the same or there are advantages of using one over another in some situations?

ead
  • 32,758
  • 6
  • 90
  • 153
Jeacom
  • 102
  • 6
  • Take a look at vpython - https://www.glowscript.org/docs/VPythonDocs/VisualIntro.html – Sharun Aug 02 '19 at 20:38
  • Just as a note, I'm not going to wrap those methods, I would like to know about their C performance. – Jeacom Aug 02 '19 at 20:48
  • 1
    Related: https://stackoverflow.com/questions/28988042/c-when-to-return-by-value-or-pass-reference – ead Aug 02 '19 at 20:58
  • If you are looking for something game specific with basic math primitives written in cython, you can take a look at my library, pyorama, [here](https://github.com/AnishN/pyorama). The relevant code is under the math3d folder. I personally went with the second approach as it allows you to be more flexible with memory layout. I will write up an answer for my reasons as to why I preferred that function declaration. – CodeSurgeon Aug 02 '19 at 23:38
  • 1
    Bottom line, returning a struct causes a copy of all struct data, returning a pointer is 8-bytes (or whatever the pointer size for your OS is) – David C. Rankin Aug 02 '19 at 23:54
  • 1
    @DavidC.Rankin not true in C (although IDK about cython). Popular ABIs have the caller pass the address of the place the return value will be stored and the function can write directly into it (e.g. [Example of return value 3](https://learn.microsoft.com/en-us/cpp/build/x64-calling-convention?view=vs-2019). If you try on popular compilers you can find the address of a local struct in the called function is the same as the address of the object in the caller that it's returned to – M.M Aug 03 '19 at 00:40
  • @M.M - there may be some compilers that provide for a caller return of an address, but gcc doesn't. A function with a locally declared and filled struct returned to the caller will have different address. [Example - struct returned to caller](https://pastebin.com/iZU6GyJp) – David C. Rankin Aug 03 '19 at 00:52
  • Functions to do math are often understood more clearly as `return_valuefoo(a,b)`. – chux - Reinstate Monica Aug 03 '19 at 01:11
  • @DavidC.Rankin It's not "provide for a caller return of an address". The caller provides the address where the return value will go; then the callee writes directly into that. In your example with x64 ABI , registers are used to pass a retval of 16 bytes or less ; try making your struct bigger (e.g. add `int j,k;`). I get the same address for both statements with clang 8.0.0. You can see in [the assembly](https://gcc.godbolt.org/z/3Zk1JX) that gcc just writes to the address given . If you uncomment `main` then it inlines the function and uses two different memory locations. – M.M Aug 03 '19 at 01:42
  • Oh, yes, I understand if passed as a parameter, the point I was making was if using a struct declared local to function and relying on the assignment on return, if the struct size is larger than a pointer size, you are better off allocating and returning a pointer to avoid the copy. – David C. Rankin Aug 03 '19 at 04:36
  • @M.M it is not enough to look just at the called function. When, the copying would happen in the caller, and whether it is needed or not depends on the code of the caller, see for example: https://godbolt.org/z/GfRJLK – ead Aug 04 '19 at 21:12
  • @DavidC.Rankin I am talking about assignment on return, there often is no copy involved (I have provided evidence as such). – M.M Aug 04 '19 at 22:00
  • @ead in your example `out` might alias `in` or some static state , so to be on the safe side the compiler makes a copy. I think that if `out` were `restrict` then the copy is not required but for whatever reason the compiler doesn't take advantage of that freedom – M.M Aug 04 '19 at 22:07
  • @DavidC.Rankin someone's [made a new question](https://stackoverflow.com/questions/57377314/what-prevents-the-usage-of-a-function-argument-as-hidden-pointer) on this topic – M.M Aug 07 '19 at 23:25
  • @M.M Thank You. Both Peter and John provide educational answers on why the use of `restrict` does not cure the problems with the access of `out` and how the optimization sought is not workable. – David C. Rankin Aug 08 '19 at 00:40

2 Answers2

2

Without getting into too much details, the answer is: do what ever is best for the readiblity of the code or the logic of your code (or function in question).

Saying, that it makes no difference, would be not completely honest - there are probably some cases where a non-negligible difference in running times can be measured - but most probably it will not be the case for you.

If you expect, that the function will be inlined - there will be no difference at all in the end: after inlining the optimizer will transform the code to the same binary (I have added an example illustrating this at the end of the post). The inlining is what one should try to achieve in such cases: not only does it save call-overhead but also makes optimizations possible which would not be possible otherway (here is a simple example, where inlining got running time from O(n) to O(1)).

If the code will not be inlined, then the result depends on the used ABI - but most probably, the second version will lead to a slightly more performant binary - yet the advantage is quite neglegible in most of the cases.


Here I'm taking a look at 64bit-Linux (which uses System V AMD64 - ABI). The Cython will translate your example to effectivily following C-code:

struct Vec3{
    double x, y, z;
};

struct Vec3 vadd_v1(struct Vec3* a, struct Vec3* b){
    struct Vec3 out;
    out.x = a->x + b->x;
    out.y = a->y + b->y;
    out.z = a->z + b->z;
    return out;
}

void vadd_v2(struct Vec3* a, struct Vec3* b, struct Vec3* out){
    out->x = a->x + b->x;
    out->y = a->y + b->y;
    out->z = a->z + b->z;
}

When compiled with optimization on it will lead to the following assemblers (here a little bit resorted to be able to compare better):

vadd_v1:                     vadd_v2:
;out.x = a->x + b->x;        ;out.x = a->x + b->x;
  movsd   (%rsi), %xmm2        movsd   (%rdi), %xmm0
  addsd   (%rdx), %xmm2        addsd   (%rsi), %xmm0
  movsd   %xmm2, (%rdi)        movsd   %xmm0, (%rdx)

;out.y = a->y + b->y;        ;out.y = a->y + b->y;
  movsd   8(%rsi), %xmm1       movsd   8(%rdi), %xmm0
  addsd   8(%rdx), %xmm1       addsd   8(%rsi), %xmm0
  movsd   %xmm1, 8(%rdi)       movsd   %xmm0, 8(%rdx)

;out.z = a->z + b->z;        ;out.z = a->z + b->z;
  movsd   16(%rsi), %xmm0      movsd   16(%rdi), %xmm0
  addsd   16(%rdx), %xmm0      addsd   16(%rsi), %xmm0
  movsd   %xmm0, 16(%rdi)      movsd   %xmm0, 16(%rdx)

;return                      ;return
  movq    %rdi, %rax
  ret                          ret

An object of type Vec3 is of type MEMORY because it has 3 double-values (the whole algorithm can be looked up in the ABI). Thus, in the first version the caller is responsible to allocate memory for the return-value and pass its address in the "hidden pointer" %rdi

As one can see, the first version has an additional movq %rdi, %rax because the pointer the the returned object must be returned in %rax, as specified by the ABI:

  1. If the type has class MEMORY, then the caller provides space for the return value and passes the address of this storage in %rdi as if it were the first argument to the function. In effect, this address becomes a “hidden” first argument. This storage must not overlap any data visible to the callee through other names than this argument.

    On return %rax will contain the address that has been passed in by the caller in %rdi.

Obviously, the second version is more efficient, but will this one instruction really matter?


However, there are also some examples, where the first version would be more efficient.

If we would use a struct of two doubles rather than a struct of three - the first version would need less instructions: the result is no longer of type MEMORY and will be passed in registers (once again resorted for comparison):

vadd_v1:                       vadd_v2:
 ;out.y = a->y + b->y;           ;out.y = a->y + b->y;
   movsd   (%rdi), %xmm0            movsd   (%rdi), %xmm0
   addsd   (%rsi), %xmm0            addsd   (%rsi), %xmm0
                                    movsd   %xmm0, (%rdx)

 ;out.y = a->y + b->y;           ;out.y = a->y + b->y;
   movsd   8(%rdi), %xmm1           movsd   8(%rdi), %xmm0
   addsd   8(%rsi), %xmm1           addsd   8(%rsi), %xmm0
                                    movsd   %xmm0, 8(%rdx)

 ;return                         ;return
   ret                              ret

However, there might be additional costs, depending on how the functions in question are called. When one returns the value instead of passing a pointer - one should stick with it:

struct Vec3 use_v1(struct Vec3 *in){
        return vadd_v1(in, in);
}

leads to an assembler without copying of returned data:

use_v1:
        pushq   %r12
        movq    %rsi, %rdx
        movq    %rdi, %r12
        call    vadd_v1
        movq    %r12, %rax
        popq    %r12
        ret

While

void use_v2(struct Vec3 *in, struct Vec3 *out){
    *out = vadd_v1(in, in);
}

would lead to

use_v2:
        pushq   %rbx
        movq    %rdi, %rdx
        movq    %rsi, %rbx
        movq    %rdi, %rsi
        subq    $32, %rsp
        movq    %rsp, %rdi
        call    vadd_v1
        movdqu  (%rsp), %xmm0       ;copying
        movq    16(%rsp), %rax      ;copying
        movups  %xmm0, (%rbx)       ;copying
        movq    %rax, 16(%rbx)      ;copying
        addq    $32, %rsp
        popq    %rbx
        ret

where the result of vadd_v1 is created on the stack and is then copied to the pointer out. It must must be done this way, because out cannot be passed as "hidden pointer" to vadd_v1, as the complier doesn't know whether out is used somewhere in vadd_v1 or not (for example as a global variable). There is a SO-question, which look at the above function in more detail.

The advantage of using the pointer-version, unless there is a compiler-bug: you can be pretty sure that no copying is happening.


Here is an example, that when inlined, both versions lead to the same binary:

double sum_v1(struct Vec3* a){
    struct Vec3 d = vadd_v1(a,a);
    return d.x;
}

double sum_v2(struct Vec3* a){
    struct Vec3 d;
    vadd_v2(a, a, &d);
    return d.x;
}

leads when compiled to the same assembler:

sum_v1/sum_v2:
        movsd   (%rdi), %xmm0
        addsd   %xmm0, %xmm0
        ret
ead
  • 32,758
  • 6
  • 90
  • 153
  • 1
    Wow, thanks a lot, super clear answer, and also this compiler explorer, pretty useful. – Jeacom Aug 03 '19 at 00:12
  • Note that in both cases, the compiler was able to do the same as if the vector had been returned by value, so there's a fair point to make that this is the more performant approach. This will matter on projects that spread vector functions across multiple files or libraries and don't have LTO, for instance. – zneak Aug 07 '19 at 23:56
1

@ead already wrote an excellent answer that touched upon the intricacies of the assembly that gets for both of the example functions you showed (never knew about Godbolt, will have to remember that website!). There are some additional aspects that can be considered in choosing between returning a struct (option A) and operating on a passed in struct pointer (option B).

Option A has a potential benefit from a usability standpoint in that you can chain operations together. Imagine if you wanted to add a Vec3 a, Vec3 b, and Vec3 c together and store the result in a Vec3 d. Here is what each option would look like:

#Assume these have some values
cdef Vec3 a, b, c, d

#Option A
d = vadd(&vadd(&a, &b), &c)

#Option B
vadd(&a, &b, &d)
vadd(&d, &c, &d) 

However, much of the visual benefit of this chaining is lost as the function is not a method belonging to an object class, so you do not get syntax like d = a.add(b).add(c).

Another consideration that favors option B is that if you are doing a calculation thousands of times in a loop and have some temporary Vec3 structs in your calculations, you can just create these structs outside of the loop one time and reuse them in the loop, as opposed to recreating temporary structs each time within your vadd function call and copying out the results.

A third issue to consider is how to wrap these cython operations in python. If you use an analogous approach to option A with cdef class PyVec3 objects instead, the challenge now is that you now need to return a python object. Therefore, each time you call this wrapping function, you are forced to interact with the GIL as you create a PyVec3 object on return. Doing such operations in a loop becomes prohibitively expensive. With option B for cdef class objects, simply operating on such objects does not invoke the GIL. Of course, you could structure the python wrapper differently from the underlying cython/C code, but the nice symmetry in the wrapping is lost. For these reasons, I went with option B in the math3d portion of my pyorama library.

CodeSurgeon
  • 2,435
  • 2
  • 15
  • 36
  • 1
    Disagree with the allocation freedom argument. Caller can still choose, e.g. `T x = func();` or `T *px = malloc(sizeof *px); *px = func();`. As noted by the other answer , the ABI allows passing of `&x` or `px` respectively to the function for the function to write the return value directly into. – M.M Aug 07 '19 at 23:26
  • @M.M Fair enough, you are right and I will remove that point. The dereferencing syntax in cython is a bit more awkward, using the `[0]` instead of `*`. Still though, writing the code in `Option A` would lead to an extra copy, based on what is shown for the `use_v2` example in the other answer, right? – CodeSurgeon Aug 07 '19 at 23:49