Loading 16-bit (or bigger) immediate with a Arm inline GCC assembly

Question

Note: Just here for the brevity the examples are simplified, so they do not justify my intentions. If I would be just writing to a memory location exactly like as in the example, then the C would be the best approach. However, I'm doing stuff where I can't use C in this example even when in general it would be best to stay in C.

I'm trying to load registers with values, but I'm stuck to using 8-bit immediates.

My code:

https://godbolt.org/z/8EE45Gerd

#include <cstdint>

void a(uint32_t value) {
    *(volatile uint32_t *)(0x21014) = value;
}

void b(uint32_t value) {
    asm (
        "push ip                                \n\t"
        "mov ip,       %[gpio_out_addr_high]    \n\t"
        "lsl ip,       ip,                   #8 \n\t"
        "add ip,       %[gpio_out_addr_low]     \n\t"
        "lsl ip,       ip,                   #2 \n\t"
        "str %[value], [ip]                     \n\t"
        "pop ip                                 \n\t"
        : 
        : [gpio_out_addr_low]  "I"((0x21014 >> 2)     & 0xff),
          [gpio_out_addr_high] "I"((0x21014 >> (2+8)) & 0xff),
          [value] "r"(value)
    );
}

// adding -march=ARMv7E-M will not allow 16-bit immediate
// void c(uint32_t value) {
//     asm (
//         "mov ip,       %[gpio_out_addr]    \n\t"
//         "str %[value], [ip]                     \n\t"
//         : 
//         : [gpio_out_addr]  "I"(0x1014),
//           [value] "r"(value)
//     );
// } 


int main() {
    a(20);
    b(20);
    return 0;
}

When I write a C code (see a()) then it gets assembled in Godbolt to:

a(unsigned char):
        mov     r3, #135168
        str     r0, [r3, #20]
        bx      lr

I think it uses the MOV as pseudo instruction. When I want to do the same in assembly, I could put the value into some memory location and load it with LDR. I think that's how the C code gets assembled when I use -march=ARMv7E-M (the MOV gets replaced with LDR), however in many cases this will not be practical for me as I will be doing other things with.

In the case of the 0x21014 address, the first 2 bits are zero so I can treat this 18-bit number as 16-bit when I shift it correctly, that's what I'm doing in the b(), but still I have to pass it with 8-bit immediates. However, in the Keil documentation I noticed mention of 16-bit immediates:

https://www.keil.com/support/man/docs/armasm/armasm_dom1359731146992.htm

https://www.keil.com/support/man/docs/armasm/armasm_dom1361289878994.htm

In ARMv6T2 and later, both ARM and Thumb instruction sets include:
A MOV instruction that can load any value in the range 0x00000000 to 0x0000FFFF into a register.
A MOVT instruction that can load any value in the range 0x0000 to 0xFFFF into the most significant half of a register, without altering
the contents of the least significant half.

I think my CortexM4 should be ARMv7E-M and should meet this "ARMv6T2 and later" requirement and should be able to use 16-bit immediates.

However from GCC inline assembly documentation I do not see such mention:

https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html

And when I enable the ARMv7E-M arch and uncomment the c() where I use the regular "I" immediate then I get a compilation error:

<source>: In function 'void c(uint8_t)':
<source>:29:6: warning: asm operand 0 probably doesn't match constraints
   29 |     );
      |      ^
<source>:29:6: error: impossible constraint in 'asm'

So I wonder is there a way to use 16-bit immediates with GCC inline assembly, or am I missing something (that would make my question irrelevant)?

Side question, is it possible to disable in the Godbolt these pseudo instructions? I have seen they are used with the RISC-V assembly as well, but I would prefer to see disassembled real bytecode to see what exact instructions these pseudo/macro assembly instructions resulted.

You can use the lower case `i` constraint. That will not do range checking but should be fine for your purposes. Also if you need a value in a register you might be able to use that register as input and leave the loading to the compiler. gcc will happily generate a `movw r3, #4116` for you if you use `"r"(0x1014)`. — Jester, May 30 '21 at 00:04
@Jester thank you very much, both of them work well: https://godbolt.org/z/84ME1hnar so I will be able to use whichever is more suitable, I think loading it in C and not forcing too much hand of the compiler might be best for most of my cases. — Anton Krug, May 30 '21 at 00:21
Another nice thing about using an input value and letting the compiler do it is that the compiler knows what value is in the register. If it needs to use it again, it doesn't have to re-load it. If you modify your godbolt link to do `d32(20); d32(20); d32(20);`, you get `movw r3, #4116 ; movs r2, #20 ; str r2, r3 ; str r2, r3 ; str r2, r3`, while if you do the `mov` in the asm, it always has to do the load. — David Wohlferd, May 30 '21 at 01:34
Replying to your comment on your now-deleted answer: future readers who are beginners with inline asm aren't going to realize how it was simplified / which essential part to put back in. GNU C inline asm is hard enough to learn without being misled by unsafe bad examples, so I downvote or edit any such SO answers that aren't safe. It's fully possible to write a small wrapper function that stores a byte, half-word, or word, to an MMIO register correctly with inline asm. If you made the small changes to fix the bugs, I'd remove my downvote and maybe upvote. — Peter Cordes, May 30 '21 at 02:08
Of course, as I said, this seems pointless vs. writing `*(volatile uint8_t*)addr = value` and letting the compiler generate asm. https://gcc.gnu.org/wiki/DontUseInlineAsm — Peter Cordes, May 30 '21 at 02:09
You don't need to push/pop `ip` in your asm, just list `"ip"` as a clobber as part of the asm statement, like `asm("..." : : ... inputs : "ip");`. Also, I'm skeptical that you can't do this in C. You probably can, with at most some inline asm for a special instruction like `asm("dsb sy" ::: "memory")` as a memory barrier if needed. (Or `dsb ish`, depending on what strength of barrier is needed for your MMIO.) — Peter Cordes, May 30 '21 at 03:21
Also note that `lsl ip, ip, #8` is pretty inefficient: ARM immediates are already encoded with a rotate count, so for example you could `mov ip, %[gpio_out_addr_high]<<8`. Any 8-bit value with an even shift-count works as an operand for `mov`, if you really insist on using `mov` inside an asm statement instead of letting the compiler construct a value in a register for you. — Peter Cordes, May 30 '21 at 03:23

Anton Krug · Accepted Answer · 2021-05-30T13:42:12.073

@Jester in the comments recommended either to use i constrain to pass larger immediates or use real C variable, initialize it with desired value and let the inline assembly take it. This sounds like the best solution, the least time spent in the inline assembly the better, people wanting better performance often underestimate how powerful the C/C++ toolchain can be at optimizing when given correct code and for many rewriting the C/C++ code is the answer instead of redoing everything in assembly. @Peter Cordes mentioned to not use inline assembly and I concur. However in this case the exact timing of some instructions was critical and I couldn't risk the toolchain slightly differently optimize the timing of some instructions.

Bit-banging protocols is not ideal, and in most cases the answer is to avoid bit-banging, however in my case it's not that simple and other approaches didn't work:

SPI couldn't be used to stream the data as I needed to push more signals, and have arbitrary lengths, while my HW supported only 8-bit/16-bit.
Tried to use DMA2GPIO and had issues with jitter.
Tried IRQ handler, which is too much overhead and my performance dropped (as you see below there are only 2 nops, so not much space to do in the free time).
Tried pre-baking stream of bits (including the timing), however for 1 byte of real data I had ended up saving 64bytes of stream data and overall reading from memory so much was much slower.
Pre-backing functions for each write value (and having a lookup table of functions, for each value write) worked very well, actually too fast because now the toolchain had compile-time known values and was able to optimize it very well, my TCK was above 40MHz. The problem was that I had to add a lot of delays to slow it down to desired speed (8MHz) and it had to be done for each input value, when the length was 8-bits or less it was fine, but for 32-bit length it was not possible to fit into the flash memory (2^32 => 4294967296) and splicing single 32-bit access into four 8-bit accesses introduced a lot of jitter on the TCK signal.
Implementing this peripheral in FPGA fabric, would allow me to be in control of everything and typically this is the correct answer, but wanted to try to implement this on a device that has no fabric.

Long story short, bit-banging is bad and mostly there are better ways around it and unecesary using inline assembly might actually produce worse results without knowing, but in my case I needed it. And in my previous code was trying to focus on a simple question about the immediates and not go into tangents or X-Y problem discussion.

So now back to the topic of 'passing bigger immediates to the assembly', here is the implementation of a much more real-world example:

https://godbolt.org/z/5vbb7PPP5

#include <cstdint>

const uint8_t TCK = 2;
const uint8_t TMS = 3;
const uint8_t TDI = 4;
const uint8_t TDO = 5;

template<uint8_t number>
constexpr uint8_t powerOfTwo() {
    static_assert(number <8, "Output would overflow, the JTAG pins are close to base of the register and you shouldn't need PIN8 or above anyway");
    int ret = 1;
    for (int i=0; i<number; i++) {
        ret *= 2;
    }
    return ret;
}

template<uint8_t WHAT_SIGNAL>
__attribute__((optimize("-Ofast")))
uint32_t shiftAsm(const uint32_t length, uint32_t write_value) {
    uint32_t addressWrite = 0x40021014; // ODR register of GPIO port E (normally not hardcoded, but just for godbolt example it's like this)
    uint32_t addressRead  = 0x40021010; // IDR register of GPIO port E (normally not hardcoded, but just for godbolt example it's like this)

    uint32_t count     = 0;
    uint32_t shift_out = 0;
    uint32_t shift_in  = 0;
    uint32_t ret_value = 0;

    asm volatile (
    "cpsid if                                                  \n\t"  // Disable IRQ
    "repeatForEachBit%=:                                       \n\t"

    // Low part of the TCK
    "and.w %[shift_out],   %[write_value],    #1               \n\t"  // shift_out = write_value & 1
    "lsls  %[shift_out],   %[shift_out],      %[write_shift]   \n\t"  // shift_out = shift_out << pin_shift
    "str   %[shift_out],   [%[gpio_out_addr]]                  \n\t"  // GPIO = shift_out

    // On the first cycle this is redundant, as it processed the shift_in from the previous iteration.
    // First iteration is safe to do extraneously as it's just doing zeros
    "lsr   %[shift_in],    %[shift_in],       %[read_shift]    \n\t"  // shift_in = shift_in >> TDI
    "and.w %[shift_in],    %[shift_in],       #1               \n\t"  // shift_in = shift_in & 1
    "lsl   %[ret_value],   #1                                  \n\t"  // ret = ret << 1
    "orr.w %[ret_value],   %[ret_value],      %[shift_in]      \n\t"  // ret = ret | shift_in

    // Prepare things that are needed toward the end of the loop, but can be done now
    "orr.w %[shift_out],   %[shift_out],      %[clock_mask]    \n\t"  // shift_out = shift_out | (1 << TCK)
    "lsr   %[write_value], %[write_value],    #1               \n\t"  // write_value = write_value >> 1
    "adds  %[count],       #1                                  \n\t"  // count++
    "cmp   %[count],       %[length]                           \n\t"  // if (count != length) then ....

    // High part of the TCK + sample
    "str   %[shift_out],   [%[gpio_out_addr]]                  \n\t"  // GPIO = shift_out
    "nop                                                       \n\t"
    "nop                                                       \n\t"
    "ldr   %[shift_in],    [%[gpio_in_addr]]                   \n\t"  // shift_in = GPIO
    "bne.n repeatForEachBit%=                                  \n\t"  // if (count != length) then  repeatForEachBit

    "cpsie if                                                  \n\t"  // Enable IRQ - the critical part finished

    // Process the shift_in as normally it's done in the next iteration of the loop
    "lsr   %[shift_in],    %[shift_in],       %[read_shift]    \n\t"  // shift_in = shift_in >> TDI
    "and.w %[shift_in],    %[shift_in],       #1               \n\t"  // shift_in = shift_in & 1
    "lsl   %[ret_value],   #1                                  \n\t"  // ret = ret << 1
    "orr.w %[ret_value],   %[ret_value],      %[shift_in]      \n\t"  // ret = ret | shift_in

    // Outputs
    : [ret_value]       "+r"(ret_value),
      [count]           "+r"(count),
      [shift_out]       "+r"(shift_out),
      [shift_in]        "+r"(shift_in)

    // Inputs
    : [gpio_out_addr]   "r"(addressWrite),
      [gpio_in_addr]    "r"(addressRead),
      [length]          "r"(length),
      [write_value]     "r"(write_value),
      [write_shift]     "M"(WHAT_SIGNAL),
      [read_shift]      "M"(TDO),
      [clock_mask]      "I"(powerOfTwo<TCK>())

    // Clobbers
    : "memory"
    );

    return ret_value;
}

int main() {
    shiftAsm<TMS>(7,  0xff);                  // reset the target TAP controler
    shiftAsm<TMS>(3,  0x12);                  // go to state some arbitary TAP state
    shiftAsm<TDI>(32, 0xdeadbeef);            // write to target

    auto ret = shiftAsm<TDI>(16, 0x0000);     // read from the target

    return 0;
}

@David Wohlferd comment about making less assembly will give more chances for the toolchain to optimize further the 'load of addresses into the registers', in case of inlining it shouldn't load the addresses again (so they are done only once yet there are multiple invocations of reads/writes). Here is inlining enabled:

https://godbolt.org/z/K8GYYqrbq

And the question, was it worth it? I think yes, my TCK is dead spot 8MHz and my duty cycle is close to 50% while I have more confidence about the duty cycle staying as it is. And the sampling is done when I was expecting it to be done and not worry about it getting optimized differently with different toolchain settings.

*trying to focus on a simple questions about the imediates and not go into the tangents, but looks like that is not desired here.* - That wasn't the reason for my downvote on your previous answer, it was that your previous answer was actually broken, e.g. writing a register without telling the compiler. Simple examples are fine and actually good for basic questions like how to use an immediate input operand at all, as long as the answer is actually a fully safe use of inline asm. — Peter Cordes, May 30 '21 at 05:55
When it's something you shouldn't actually do if your real use-case is that simple, one should mention that in the question and answer, for the benefit of future readers who are stuck in an X-Y problem where they think this is what they want, but it actually isn't. But that doesn't deserve a downvote, the reason your other answer was downvoted was for broken code that compiles to `str reg, reg` instead of `str reg, [reg]`, and lack of a clobber declaration, so it didn't achieve the minimal goal it aimed for. (I was expecting you to fix that answer, then I'd have reversed my vote.) — Peter Cordes, May 30 '21 at 05:58
This larger motivating example here makes sense as a reason for keeping everything in asm, BTW, and nicely shows that letting the compiler construct constants in registers for you is usually good. `volatile uint32_t addressWrite` makes no sense, though: the address value itself doesn't need to be stored and reloaded to memory, only the pointed-to MMIO register. You don't have other threads or interrupts modifying that C variable, so it shouldn't be `volatile`. — Peter Cordes, May 30 '21 at 06:02
If it was a *pointer*, you could tell the C compiler it was a pointer-to-volatile like `volatile uint32_t *aw = ...;`. Then it would even make sense to use two memory-output operands like `"=m"(*addressWrite)` and let the compiler pick an addressing mode, instead of manually forcing `[ %[reg] ]`. It might only put one constant in a reg and use `[r1]` for one and `[r1 + 4]` for the other one. — Peter Cordes, May 30 '21 at 06:04
Yes it was pointer. And about the X-Y problems, yes a lot of people request ridiculous solutions for very simple problems, however the question how to use bigger immediate I think is valid (and do not understand why even the question is voted to get closed for being "Seeking recommendations for books, tools, software libraries, and more"). And it feels to me as normal question no matter what my current use case is. — Anton Krug, May 30 '21 at 06:07
What's your point? I didn't say you shouldn't answer a simple question, and there's nothing wrong with simple direct answers to them as long as they're *correct*. I didn't downvote the question or vote to close it; I was purely talking about the votes on your previous *answer*. The "opinion based" or "seeking resources / books" close-votes are nonsense, cast by people who are wrong, don't worry about them. — Peter Cordes, May 30 '21 at 06:09
"Of course, as I said, this seems pointless vs. writing *(volatile uint8_t*)addr = value and letting the compiler generate asm. gcc.gnu.org/wiki/DontUseInlineAsm" Yet it looked as pointless question? — Anton Krug, May 30 '21 at 06:11
No, I didn't say it was a pointless question. I said the example in your question was a pointless use of inline asm, implying that *if* your real use-case was like that, you shouldn't use inline asm at all. (I hadn't considered the timing aspect of using inline asm since I don't normally do embedded stuff.) That's why I commented with stuff about the actual inline asm first, and only added that as a note at the end. Don't lump me in with people that voted to close your question. — Peter Cordes, May 30 '21 at 06:14
And since your answer starts off mentioning my comments, you're misinterpreting what I said. I didn't mean that examples need to be full real-life complicated use-cases. I meant that when you write a simple example, it's critically important not to leave out any pieces that make that simple example safe. e.g. `asm ( "mov ip, %0\n" "str %0, [ip]" ::"i"(0x1014), "r"(value) :"ip");` would be ok as a way to do a 4-byte store, but without the `"ip"` clobber is very dangerous for future readers that copy it. *That's* what I meant by not simplifying away parts that are essential. — Peter Cordes, May 30 '21 at 06:20
So the problem isn't minimalist examples in answers, it's unsafe answers. — Peter Cordes, May 30 '21 at 06:22
@PeterCordes thank you, was googling it as I'm getting this often wrong and managed to do the mistake anyway — Anton Krug, May 30 '21 at 12:59

Loading 16-bit (or bigger) immediate with a Arm inline GCC assembly

1 Answers1