9

This isn't a trivial question.
NOTE: I don't need opinions or advises to use pure asm. I actually need to get done what I'm talking about: to get inline asm without this sign/zero extend optcode when assigning result to a short int.

I'm dealing with a library that abuses 16-bit shorts for many functions and I'm optimizing it. I need to add a few optimized functions with inline asm. The problem is that in many places result of the function is assigned to a short int. That is, compiler generates uxth or sxth arm opcode.

My goal is to avoid that problem and to make sure that this useless opcode isn't generated. First of all, I need to define my optimized function to return short int. This way if it's assigned to an int or to a short int there is no extra opcode to convert the result.

The problem is that I have no clue how to skip that int->short conversion that compiler generates inside my own function.
Dumb cast like: *(short*)(void*)&value doesn't work. Compiler either starts messing with the stack making problem even more, or it still uses that same sxth to sign-extend the result.

I compile for multiple compilers, and I was able to resolve it for arm's armcc compiler, but I can't get it done with GCC (I compile with 4.4.3 or 4.6.3). With armcc I use short type inside inline asm statement. In gcc even if I use short compiler still for some reason believes that sign extension is required.

Here's a simple code snippet that I can't get to work with GCC, any advice on how to get it to work? For this simple example I'll use clz instruction:

sample file test.c file:

static __inline short CLZ(int n)
{
    short ret;
#ifdef __GNUC__
    __asm__("clz %0, %1" : "=r"(ret) : "r"(n));
#else
    __asm { clz ret, n; }
#endif
    return ret;
}

//test function
short test_clz(int n)
{
    return CLZ(n);
}



here's expected result that I get with armcc -c -O3:

test_clz:
    CLZ      r0,r0
    BX       lr

Here's unacceptable result that GCC -c -O3 gives me:

test_clz:
    clz r0, r0
    sxth    r0, r0
    bx  lr

Note also, that if rewrite CLZ with internal variable int ret; instead of short ret; then armcc generates the same result as GCC.

Quick line to get the asm output with gcc or armcc:
gcc -O3 -c test.c -o test.o && objdump -d test.o > test.s
armcc -O3 --arm --asm -c test.c

Pavel P
  • 15,789
  • 11
  • 79
  • 128
  • 1
    Why don't you skip the inline assembly and just write your optimized bit as en entire function written in assembly? Your problem seems to come from the mixing of your C function and inline asm. But why write a C function that just contains a bunch of asm inside? – TJD Jun 03 '12 at 04:41
  • not an option. I rewrote functions that really needed to be fully written in asm. To do it properly I would probably need to go over entire code and use ints instead shorts, but that task alone could take me days with amount of code that I'd need to update + plus testing. – Pavel P Jun 03 '12 at 04:46

2 Answers2

6

Compilers change. In particular gcc, what tricks you figure out today wont work tomorrow, or yesterday. And wont work consistently across compilers (armcc, clang, etc).

1) remove the shorts and replace with ints and just get it over with, it is an option, it is the least painful solution.

2) If you want specific asm, write the specific asm, dont mess around. Also an option.

While it is very possible to write code that consistently compiles better than other code, you cant always get exactly the code sequences you want, not consistently. You are hurting yourself in the long run, even the write your own asm solution. The solution you are actually looking for is to go through the code and replace the shorts with ints, that is going to produce the code that will consistently compile better than having the shorts there. It will take less time over all and wont have to be rewritten every handful of months as the compilers change.

To completely control this once and for all would be to compile to asm or disassemble and remove the offending instructions, leaving the function in asm. Fast and easy to complete the task, will give you want you want for removing this overhead, just leaves something that is not very maintainable. Actually, since you have armcc doing what you want compile to asm in armcc then patch it up for the stupidity of gnu assembler habits, and use that as the one solution (possible to write asm that assembles both under arm tools and gnu, at least in the arm ads days, didnt have much rvct time before I lost access to the tools).

There are a number of ways to get your exact example you have provided to give the exact results you are after, but I doubt seriously that is what you are after, you would have written the two lines of asm and been done. My guess is you are trying to inline something in a function (bigger than CLZ) while still calling it a short, when calling it an int will give you what you want without the inline asm. (I still cant see how inline asm wherever there is a short takes less time to implement and test than changing the variable declaration, much less typing, the same amount of code to read and test).

So here is your reality:

1) live with shorts and their side effects

2) change them to ints

Taking days or weeks or months to do something is not a big deal. Most of the time it takes days, weeks, months to avoid doing something. And then you have to do it anyway, so now you have 2xdays, 2xweeks, 2xmonths...You have to, or should, test it no matter what solution, you are changing the code, so that is not a varying factor in this decision. Hacking at the compiler with inline asm, is your highest risk, and should result in the most testing if testing does vary in the time equation. A handful of gcc versions required, plus retest every 6 months.

Normally the asm solution would be when the abi changes, maybe 10 years between retesting, and just fix the C would be 20 years maybe when we go 64 bit to 128 bit. But the 32 to 64 bit transition is still going on and we have not started the ARM 32 to 64 bit transition/mixture (wont abandon 32 bit arm processors for all 64 bit, both will remain). The backends are going to be a mess for a while, I wouldnt play games with them right now. Making clean, portable, C, where you dont rely on the size of int in the code (assume/require 32 minimum but make sure it is 64 bit clean) is your cheapest solution.

old_timer
  • 69,149
  • 8
  • 89
  • 168
  • Hi dwelch, thanks for a very long and descriptive reply. I clearly understand all your points and I've done all that where I really needed it. At this point all I want is to get GCC not to generate SXTH in place where I know it's not needed (basically, I don't need to know about options, probably I know all of them). All I want is to get that proper behavior with my compiler only and I'm done (even if future compiler will add that SXTH again, it's ok). – Pavel P Jun 03 '12 at 14:50
  • In short, I expect some kind of gcc specific "hack" to cast types or some other inline asm modifiers maybe to get that behavior. I tried different casts and modifiers and nothing helped. – Pavel P Jun 03 '12 at 15:03
1

If it's speed you're after, and not code size, you can try this:

static __inline short CLZ(int n)
{
    short ret;
#ifdef __GNUC__
    __asm__("clz %0, %1\n"
            "bx lr"
            : "=r"(ret) : "r"(n));
#else
    __asm { clz ret, n; }
#endif
    return ret;
}

Updated to add: It seems to me that the gcc compiler is doing the right thing here. In C (as opposed to C++), there is no such thing as a function that returns a short -- it always gets automatically converted to int. So you have no option but to fool the compiler. What happens if you just change the filename to test.cpp?

TonyK
  • 16,761
  • 4
  • 37
  • 72
  • Tony, I'll try, not sure if that inline bx lr might negatively affect optimizer, or if it's at all possible. Obviously,if it's *only* getting the example rolling it's not a solution: that short version of CLZ is used all over the place and where it's used SXTH shouldn't be generated. THERE IS such thing as returning short, even though it uses the same 32-bit register. But gcc thinks that it needs to sign-extend and I know that this is not necessary. Basically that SXTH function makes sure that top 16 bits are zeros or ones, but I know that input to that functions is already in proper layout. – Pavel P Jun 03 '12 at 14:55
  • also, I actually use c++ compilation to get some simple features like function overloading etc. But there is no difference in this example whether it was c or c++ compiletaion – Pavel P Jun 03 '12 at 14:57
  • I tried that extra bx lr, it seems that gcc doesn't do opcode level optimization: after bx lr it adds the same pair again: SXTH + one more bx lr. In short, if test_clz had this body: CLZ(CLZ(a)) then result would be bad: when inlined, GCC still ads bx lr that would inadvertedly jump put of the calling function. Perhaps that could work with llvm/clang where it does some opcode level optimization and might detect that either sxth isn't necessary or that return is done from inline asm itself. – Pavel P Jun 03 '12 at 15:08
  • @Pavel: after my "bx lr", any following instructions are NOT EXECUTED because the function has already returned. Why doesn't this solve your problem? As for negatively affecting the optimiser, the inline gcc assembler doesn't analyse the inline instructions at all, so you can forget that. But I'm surprised that the behaviour persists in `C++` -- are you sure? – TonyK Jun 03 '12 at 15:37
  • Tony, this is not going to work, isn't that clear? :) this bx lr sits in *inline* functions, which means that any function that calls it will most likely crash the app. I don't have goal to get that simple test working, the goal is to get CLZ itself working and the test_clz is sued to verify that. Your "solution" questionably makes test_clz "work", but will corrupt any program that uses that CLZ – Pavel P Jun 03 '12 at 17:36
  • @Pavel, as far as I know, functions that contain inline asm blocks are never inlined. But check it out, and let me know if I'm wrong! – TonyK Jun 03 '12 at 20:29
  • off course they are inlined. Even in this example CLZ is inlined into test_clz – Pavel P Jun 04 '12 at 06:45