AVR assembly - bit number to mask

Question

In my ATtiny84a AVR Assembly program I end up with a bit number between 0 and 7, in a register, lets say r16. Now I need to create a mask with that bit number set. To make it more complicated, the timing of the operation must be the same, regardless of what bit is set.

For example if r16 = 5 the resulting mask will be 0x20 (bit 5 set).

So far I have shifted a bit into position by LSL and using r16 (the bit number) as a loop counter, then to keep exact timing regardless bit number, do a dummy loop of NOP 8-r16 times.

The assembly instruction SBR sets bit(s) in a register from a mask so it can't be used. The assembly instruction SBI sets a bit in an I/O register from bit number, but it is a constant, not a register (I could have used an I/O register as a temp register).

The mask is then used to clear a bit in a memory location, so if there is another solution to do that from a bit number in a register, then it's fine too.

I have another solution to try out (shift based with carry) but I was hoping that someone have a more elegant solution than loops and shiftings.

@Michael sub/ror approach below also fails gracefully on out of bounds input index, whereas lookup would probably not without additional checks or more flash. — bigjosh, Jun 25 '20 at 18:23

bigjosh · Answer 1 · 2020-06-28T21:05:12.157

6

I think your hunch with shifts and carries is an elegant solution. You'd basically decrement the index register, set the carry when the decrement was zero, and then shift the carry into the output register.

You can use subtract to do the decrement, which will automatically set the carry bit when the index hits 0.

You can use a rotate right instead of the shift since this lets you move the bits in the right direction to match the decement.

Then you can get really tricky and use a sentinel bit in the output as a psuedu loop counter to terminate after 8 loop iterations.

So something like...

; Assume r16 is the index 0-7 of the bit to set in the output byte
; Assume r17 is the output byte
; r17 output will be 0 if r16 input is out of bounds
; r16 is clobbered in the process (ends up as r16-8)

ldi r17, 0b10000000 ; Sort of a psuedo-counter. When we see this 
                    ; marker bit fall off the right end
                    ; then we know we did 8 bits of rotations

loop:
subi r16,1  ; decrement index by 1, carry will be set if 0
ror r17     ; rotate output right, carry into the high bit
brcc loop   ; continue until we see our marker bit come output

I count 4 words (8 bytes) of storage and 24 cycles this operation on all AVRs, so I think winner on size, surprisingly (even to me!) beating out the strong field of lookup-table based entries.

Also features sensible handling of out of bonds conditions and no other registers changed besides the input and output. The repetitive rotates will also help prevent carbon deposit buildup in the ALU shifter gates.

Many thanks to @ReAI and @PeterCordes who's guidance and inspiration made this code possible! :)

edited Jun 28 '20 at 21:05

answered Jun 25 '20 at 17:42

bigjosh

1,273
13
19

Would it be worth considering a computed `ijmp` into a sequence of *just* `ror` or `rol` instructions? If you start with carry set, the first one will shift in a `1`, and the rest will just shift it. Worst-case time is probably similar, and that's probably what matters for most applications, though. – Peter Cordes Jun 25 '20 at 20:01
@PeterCordes Interesting idea! Unfortunately I think this approach would tie with the look up table for space but always lose to it for time since it requires all the same steps to set up and restore the Z register, and the IJMP replaces the LPM. IS there any way to jump into a table on AVR without the Z reg? – bigjosh Jun 26 '20 at 00:41
I don't know AVR very well, just what I happen to see in flipping through an instruction table like [this](http://atmel-studio-doc.s3-website-us-east-1.amazonaws.com/webhelp/GUID-0B644D8F-67E7-49E6-82C9-1B2B9ABE6A0D-en-US-1/index.html). Good point that if you're set for an `ijmp` you might as well load data from a table instead. You don't necessarily have to save/restore Z (or X or Y), though; just clobber it and let the following code set it to something if it wants it. – Peter Cordes Jun 26 '20 at 00:48
1

The first two commands should be `clr r17 $ inc r17` – ReAl Jun 26 '20 at 16:13
1

@ReAI Fixed the `clr` to `r17`, thx! I think the `inc` should be `r16` since this is meant to increment the input so we can then decrement it to set the carry flag. Make sense? – bigjosh Jun 27 '20 at 18:22
Can you use regular `subi` for the first one instead of `clc` / `sbci r16,1`? Or perhaps `cmp r16, 1` to set Carry if the input is 0 without modifying it? – Peter Cordes Jun 27 '20 at 18:36
This is a very creative solution for setting a bit, I like it! – Max Kielland Jun 28 '20 at 11:54
Ah yes, @PeterCordes that is much better! I had assumed that `subi` did not update the carry bit but the datasheets say otherwise. This also lets me change all the other `sbci`s and thereby get rid of the `clr r17` at the top since now it does not matter what gets rotated into carry by the `ror`s. Thanks! – bigjosh Jun 28 '20 at 20:24
On most ISAs that have separate add / adc instructions, a BigInt operation looks like `add / adc / adc / ...` or `sub / sbb / sbb / ...`, with `add/sub` writing flags normally but not reading flags as an input. That's also how `sub` / `cmp` set flags for conditional branches to test; e.g. `brlo` is just testing if the carry flag is set. Hopefully that helps you grok why it's designed this way. – Peter Cordes Jun 28 '20 at 20:42
@PeterCordes Totally get it - just when I saw the instruction was called "subtract without carry" I foolishly read that mean that is did not use carry bit at all! Always check the datasheets! Anyway, this exchange ultimately lead to a much better algorithm, so thanks again! :) – bigjosh Jun 28 '20 at 21:07
Neat idea for a size-optimized loop, nice change to replace the unrolled version that turned out to have no advantages over ReAl's clever sbrc / swap. But yeah, I though "subtract without carry" sounded like a funny description. I had been playing around with AVR GCC on https://godbolt.org/ to see how GCC did things like incrementing an `int` or pointer, so I wasn't misled. I guess working with wider integers is so common on an 8-bit machine that it's the exception, not the rule? e.g. rotates are rotate-through-carry (unlike on x86 where `rcr`rot-through-carry, `ror` isn't.) – Peter Cordes Jun 28 '20 at 21:21

score 3 · Answer 2 · answered Jun 26 '20 at 16:18

3

9 words, 9 cycles

ldi r17, 1

; 4
sbrc    r16, 2  ; if n >= 4
swap    r17     ; 00000001 -> 00010000, effectively shift left by 4

; 2
sbrc    r16, 1
lsl     r17
sbrc    r16, 1
lsl     r17

; 1
sbrc    r16, 0
lsl     r17

answered Jun 26 '20 at 16:18

ReAl

1,231
1
8
19

1

That's lovely! I've never seen `SWAP` used productively before! – bigjosh Jun 27 '20 at 16:51
If you're looking for performance and space savings I would say this is good, but it's not very intuitive as to what the goal is. – LarryBud Feb 20 '22 at 16:38

score 2 · Answer 3 · answered Jun 27 '20 at 08:30

2

Since your output has only 8 variants you can use a lookup table. It will do exact the same operations whatever input is thus having exact the same execution time.

  ldi r30, low(shl_lookup_table * 2) // Load the table address into register Z
  ldi r31, high(shl_lookup_table * 2)

  clr r1 // Make zero

  add r30, r16 // Add our r16 to the address
  adc r31, r1  // Add zero with carry to the upper half of Z

  lpm r17, Z // Load a byte from program memory into r17

  ret // assuming we are in a routine, i.e. call/rcall was performed

...

shl_lookup_table:
  .db 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80

answered Jun 27 '20 at 08:30

AterLux

4,566
2
10
13

1

Can't you align you lookup table to avoid `clr`/`adc`? (Or really just make sure it doesn't cross a 256-byte boundary, but aligning by 8 is an easy way to do that.) Even better, if you align by 256, you can drop the `ldi r30`. Or if you arrange for the value to be in `r30` in the first place, you can `add r30, low(shl_lookup_table * 2)`, I think. – Peter Cordes Jun 27 '20 at 08:50
Posted an answer with that idea: 5 cycles, 7 words total storage including the table. – Peter Cordes Jun 27 '20 at 10:31

Peter Cordes · Answer 4 · 2020-06-27T11:59:45.383

An 8-byte aligned lookup-table simplifies indexing should be good for AVR chips that support lpm - Load from Program Memory. (Optimized from @AterLux's answer). Aligning the table by 8 means all 8 entries have the same high byte of their address. And no wrapping of the low 3 bits so we can use ori instead of having to negate the address for subi. (adiw only works for 0..63 so might not be able to represent an address.)

I'm showing the best-case scenario where you can conveniently generate the input in r30 (low half of Z) in the first place, otherwise you need a mov. Also, this becomes too short to be worth calling a function so I'm not showing a ret, just a code fragment.

Assumes input is valid (in 0..7); consider @ReAl's if you need to ignore high bits, or just andi r30, 0x7

If you can easily reload Z after this, or didn't need it preserved anyway, this is great. If clobbering Z sucks, you could consider building the table in RAM during initial startup (with a loop) so you could use X or Y for the pointer with a data load instead of lpm. Or if your AVR doesn't support lpm.

## gas / clang syntax
### Input:    r30 = 0..7 bit position
### Clobbers: r31.  (addr of a 256-byte chunk of program memory where you might have other tables)
### Result:   r17 = 1 << r30

  ldi   r31, hi8(shl_lookup_table)    // Same high byte for all table elements.  Could be hoisted out of a loop
  ori   r30, lo8(shl_lookup_table)    // Z = table | bitpos  = &table[bitpos] because alignment

  lpm   r17, Z

.section .rodata
.p2align 3        // 8-byte alignment so low 3 bits of addresses match the input.
           // ideally place it where it will be aligned by 256, and drop the ORI
           // but .p2align 8 could waste up to 255 bytes of space!  Use carefully
shl_lookup_table:
  .byte 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80

If you can locate the table at a 256-byte alignment boundary, you can drop the lo8(table) = 0 so you can drop the ori and just use r30 directly as the low byte of the address.

Costs for the version with ori, not including reloading Z with something after, or worse saving/restoring Z. (If Z is precious at the point you need this, consider a different strategy).

size = 3 words code + 8 bytes (4 words) data = 7 words. (Plus up to 7 bytes of padding for alignment if you aren't careful about layout of program memory)
cycles = 1(ldi) + 1(ori) + 3(lpm) = 5 cycles

In a loop, of if you need other data in the same 256B chunk of program memory, the ldi r31, hi8 can be hoisted / done only once.

If you can align the table by 256, that saves a word of code and a cycle of time. If you also hoist the ldi out of the loop, that leave just the 3-cycle lpm.

(Untested, I don't have an AVR toolchain other than clang -target avr. I think GAS / clang want just normal symbol references, and handle the symbol * 2 internally. This does assemble successfully with clang -c -target avr -mmcu=atmega128 shl.s, but disassembling the .o crashes llvm-objdump -d 10.0.0.)

This is excellent. I use look up tables A LOT and have always been a bit frustrated by needing to use `add` then `adc` to calculate the offset but have never considered this solution which is really obvious with hindsight. — Andy Preston, May 23 '21 at 07:54

Max Kielland · Accepted Answer · 2020-07-16T11:55:30.990

1

Thank you all for your creative answers, but I went with the lookup table as a macro. I find this being the most flexible solution because I can easily have different lookup tables for various purposes at a fixed 7 cycles.

; @0 mask table
; @1 bit register
; @2 result register
.MACRO GetMask
    ldi     ZL,low(@0)
    ldi     ZH,high(@0)
    add     ZL,@1
    adc     ZH,ZERO
    lpm     @2,Z
.ENDM

bitmask_lookup:
    .DB 0x01,0x02,0x04,0x08,0x10,0x20,0x40,0x80
inverse_lookup:
    .DB ~0x01,~0x02,~0x04,~0x08,~0x10,~0x20,~0x40,~0x80
lrl2_lookup:
    .DB 0x04,0x08,0x10,0x20,0x40,0x80,0x01,0x02

ldi r16,2
GetMask bitmask_lookup, r16, r1 ; gives r1 = 0b00000100
GetMask inverse_lookup, r16, r2 ; gives r2 = 0b11111011
GetMask lrl2_lookup,    r16, r3 ; gives r3 = 0b00010000 (left rotate by 2)

Space is not so much of an issue, but speed is. However, I think this is a good compromise and I'm not forced to align data on quadwords. 7 vs 5 cycles is the price to pay.

I already have one "ZERO" register reserved through the whole program so it costs me nothing extra to do the 16bit addition.

edited Jul 16 '20 at 11:55

answered Jun 28 '20 at 11:24

Max Kielland

5,627
9
60
95

2

Aligning your table would save the `adc` instruction, as shown in my answer. You can also save space for the `lrl2` version by making it overlap the normal table. (i.e. just append 2 more bytes). If you use an addressing mode of `lpm r1, Z+2`, you can still just do `add ZL,@1` for the low part, with address calculation handling possible carry into the high byte of Z. Note that the required data alignment isn't word, it's quadword (8 bytes) for my answer, not just "data word alignment". – Peter Cordes Jun 28 '20 at 11:55
Yes good point, I edited word to quadword alignment. The other tables was just examples of different uses, but if I do use the rotate, overlapping tables would indeed save some bytes. – Max Kielland Jun 28 '20 at 12:02
Terminology: "invert*ed*_lookup" would be a clearer name. "inverse" lookup sounds like you're mapping `1< – Peter Cordes Jun 28 '20 at 12:02
Also, if you use your `ldi`/`add` instead of `ori`, you're fine to drop the `adc` if the table simply doesn't span a 256-byte boundary. IDK in practical terms how easy it is to make some kind of build-time `assert` to check that. – Peter Cordes Jun 28 '20 at 12:08
Could you avoid needing a ZERO register here by using sub/sbci from the *end* of the array, with tables in reverse order? There's no `adci` but there is `sbci ZH, 0`. I assume a zero reg is useful to have for other cases, but perhaps other future readers could benefit. (Although as I said, aligning the tables seems like an even better method.) – Peter Cordes Jun 28 '20 at 20:45

score 1 · Answer 6 · answered Oct 24 '22 at 10:08

It's also possible without a lookup table in 7 instructions / 7 ticks. The output must be in an upper register. The register pressure is lower, no precious Z register needed, and it doesn't need a register containing zero:

;; R22 = 1 << (R11 & 7)
ldi  R22, 1
sbrc R11, 1
ldi  R22, 4
sbrc R11, 2
swap R22
sbrc R11, 0
lsl  R22

This can easily be adjusted for right-shifts R22 = 0x80 >> (R11 & 7) or for the complement R22 = 0xff ^ (1 << (R11 & 7)).

The version that loads from a lookup-table can be 1 instruction shorter if the shift offset is already in R30, and there is no need for a register that contains zero:

;; Offset in R30 = ZL
clr   ZH
subi  ZL, low(-(table))   ;; GNU-Syntax: lo8(-(table))
sbci  ZH, high(-(table))  ;; GNU-Syntax: hi8(-(table))
lpm   R22, Z

If you align your table by 8, `low+0..7` can't carry-out into ZH. (`.balign 8` before the table). If you align the table by 256, the `low(-(table))` is zero so that happens for free as well, but you can't do that for too many different arrays / LUTs. Oh, turns out I'd already posted an answer about that when the question was new. — Peter Cordes, Oct 24 '22 at 13:36

AVR assembly - bit number to mask

6 Answers6