understanding asm blocks written for gcc

Question

what does the following assembly mean in simple C (this is meant to be compiled with gcc):

asm volatile
    (
    "mov.d %0,%4\n\t"
    "L1: bge %2,%3,L2\n\t"
    "gsLQC1 $f2,$f0,0(%1)\n\t"
    "gsLQC1 $f6,$f4,0(%5)\n\t"
    "madd.d %0,%0,$f6,$f2\n\t"
    "madd.d %0,%0,$f4,$f0\n\t"
    "add %1,%1,16\n\t"
    "add %2,%2,2\n\t"
    "add %5,%5,16\n\t"
    "j L1\n\t"
    "L2: nop\n\t" 
    :"=f"(sham)
    :"r"(foo),"r"(bar),"r"(ro),"f"(sham),"r"(bo)
    :"$f0","$f2","$f4","$f6"
    );

After several hours of searching and reading I've come up with the following assembly code in AT&T syntax:

mov.d %xmm0,%xmm1
L1: bge %ebx,%ecx,L2
gsLQC1 $f2,$f0,0(%eax)
gsLQC1 $f6,$f4,0(%esi)
madd.d %xmm0,%xmm0,$f6,$f2
madd.d %xmm0,%xmm0,$f4,$f0
add %eax,%eax,16
add %ebx,%ebx,2
add %esi,%esi,16
jmp L1
L2: nop

I'm in the process of finding a way to run this on Windows and will update when I do figure out a way to do so (after fixing all of the mistakes that I'm sure I've made).

I have very little experience with x86 assembly, that said, I vaguely recognize that this is a loop, but I haven't been able to find what the instruction gsLQC1 means. or what the purpose of the loop would be.

If you have any questions for me, I'll be happy to answer them. If you have any insights, I would love to hear them. Thank you for your time.

EDIT:

The function itself is dealing with performing a Singular Value Decomposition (SVD) which mainly has to do with matrices.

I'm updating the below with some comments of my own, the original writer of the assembly did not write these but I am 80% confident that they are correct, given my research of asm block notation for GCC.

    asm volatile
       (
       "mov.d %0,%4\n\t"
       "L1: bge %2,%3,L2\n\t"
       "gsLQC1 $f2,$f0,0(%1)\n\t"
       "gsLQC1 $f6,$f4,0(%5)\n\t"
       "madd.d %0,%0,$f6,$f2\n\t"
       "madd.d %0,%0,$f4,$f0\n\t"
       "add %1,%1,16\n\t"
       "add %2,%2,2\n\t"
       "add %5,%5,16\n\t"
       "j L1\n\t"
       "L2: nop\n\t" 
       :"=f"(sham) /*Corresponds to %0 in the above code*/
       :"r"(foo) /*Corresponds to %1*/,"r"(bar) /*%2*/,"r"(ro) /*%3*/,"f"(sham) /*%4*/,"r"(bo) /*%5*/
       :"$f0","$f2","$f4","$f6"
       );

I assumed that this was in x86, but was most likely wrong. I believe the above is MIPS64 assembly written for a processor in the loongson family.

Thank you for the interest in the question. I appreciate your time. Again, if there are any other questions, I would be happy to try my best to answer them.

P.S. the original code can be found here, and the assembly that I am asking about starts on line 189

That wasn't x86 assembly (mips I guess?), mixing it all up is not going to do any good. — harold, Aug 20 '17 at 01:49
This looks like it may be coming from a BLAS like library. I suggest you try to find one for x86 as the conversion may be difficult. — Kevin, Aug 20 '17 at 02:26
x86 `add %eax,%eax,16` would be `add $16,%eax`, or even `add $16,%%eax` (I don't know gcc inline asm rules from head, so I'm not sure when two %% ahead of reg name are used, but anyway this is enough to see that the asm is not for x86 variant). Was there some `#ifdef` ahead defining several variant of same function? Or maybe it's even split into different files. — Ped7g, Aug 20 '17 at 03:02
What processor is the original supposed to be for? I'm finding some references for gsLQC1, but they only have 2 operands. At a guess, this is loading a 256 byte value into a register pair ($f0, $f2). Also, are there no clues in the surrounding code? Comments? Function names? — David Wohlferd, Aug 20 '17 at 03:02
It doesn't appear that this asm is well-written. `add %1,%1,16` appears to be adding 16 to %1, but %1 is an input parameter, and (by spec) must not be changed. `madd` looks like a matrix add, so I'd be thinking `_mm256_add_ps`. `bge` is going to be "branch if greater than or equal to", and I'd think about duplicating the check at the bottom of the loop rather than unconditionally jumping from the bottom to the top, then (possibly) jumping right back to the bottom. — David Wohlferd, Aug 20 '17 at 03:22
@Ped7g My initial search has only found the use of assembly blocks in one file. This may just be specific to one processor family, but I'm not sure, because there are no `#defines`'s that I can see. — Jack, Aug 21 '17 at 16:00
@DavidWohlferd I've updated the question in response to your comments, thank you for making them! If I missed something, please let me know. I also appreciate your insights into what specific instructions mean. — Jack, Aug 21 '17 at 16:05
"specific to one processor family" - Assembler by definition tends to be specific to one processor family. MIPS, x86, ARM all use different assembly. "Corresponds to %0" - Yes, this is by definition. See the docs for this form of asm [here](https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html). It is often more readable to use symbolic names (described on that page) instead of numeric offsets like this. "SVD" - I hope you understand how to calculate this yourself. While I can point you toward some of the related x86 instructions, I'm not going to be translating this for you. — David Wohlferd, Aug 22 '17 at 05:05
@DavidWohlferd Thanks for your comments, I don't know the exact revision of the loongson family this was written for. I would love help finding equivalent x86 instructions. I don't expect you to just fix my problem. I've only come here after hours of searching. I apologize for my unskilled question, and am trying to revise it as requested. and yes, I do know how to calculate a singular value decomposition for a given matrix. — Jack, Aug 22 '17 at 16:02
"don't know the exact revision" It only matters if you need to figure out exactly what `madd` etc. do. Since you "do know how to calculate a SVD," it's probably not important. I'd say skip trying to "translate" this and just re-write the asm part from scratch. "equivalent x86 instructions" Unlike you, I *don't* understand how to calculate SVD (even the code right in front of me). So I can't say for certain what instructions you'll need. But apparently loading floats from memory and performing matrix operations on them is a key component, so I'd start by looking at the x86 AVX instructions. — David Wohlferd, Aug 22 '17 at 22:38
Since I avoid using inline asm whenever I can ([reasons](https://gcc.gnu.org/wiki/DontUseInlineAsm)), I'd be looking at the AVX intrinsics, starting with [_mm256_add_pd](https://software.intel.com/en-us/node/524042), which does matrix adds for 8byte floats. It looks like a function, but maps to a single x86 instruction. Then it's just about loading the values (_mm256_load_pd or _mm256_loadu_pd) and looping. In general, I expect you'll end up with something that looks roughly like my answer below. "my unskilled question" Sorry if I came across as brusque. Hard to be polite in just 600 chara — David Wohlferd, Aug 22 '17 at 22:39

score 1 · Accepted Answer · answered Aug 20 '17 at 04:00

This isn't really an answer, but it doesn't fit in a comment either. Given that you omit several critical pieces of information (what processor the source instructions are for, data types of the parameters, a general sense of what the code is doing, etc), it's hard to come up with a good answer.

In a general sense, I'd be thinking:

float messy(const float *foo, int bar, int ro, const float *bo)
{
    float sham = 0;

    while (bar < ro)
    {
       __m256 a = _mm256_load_ps(foo);
       __m256 b = _mm256_load_ps(bar);

       __m256 c = _mm256_add_ps(a, a);
       __m256 d = _mm256_add_ps(b, b);

       foo += 2;
       bar += 2;
       bo += 2;
    }

    return sham;
}

That's not going to be quite right, since (among other things) sham isn't getting set. But it's a place to start. Without details of what madd.d does (which is hard to say without knowing what hardware we're talking about), that's as close as I can get you.

Just to emphasize what I said in my comment, the original code does not appear to be well written (modifying read-only parameters, double jumps, NO COMMENTS, etc).

It's probably not right. From `Intel.com` : `_mm256_load_ps : Moves packed single-precision floating point values from aligned memory location to a destination vector. The corresponding Intel® AVX instruction is VMOVAPS.` VMOVAPS is not visible in the original assembly snippet. — Karim Manaouil, Aug 20 '17 at 06:04
@afr0ck - "VMOVAPS is not visible in the original assembly snippet." That's because the original snippit isn't x86 code. OP is translating it from some other hw platform (he doesn't say what). — David Wohlferd, Aug 20 '17 at 06:05

understanding asm blocks written for gcc

1 Answers1