0

I made an Interruptroutine with asm that is already faster than a C-Version.

Now i wonder if there could be a faster way for it (little tweaks). Any suggestions would be much appreciated.

Times (Atmega328):

  • My asm: 65 clocks
  • Atmel Studio: 108 clocks
  • the code in the answer: 48 clocks

Measured from first instruction of Interrrupt to reti.

ADC_vect:                       
push r18
in r18, SREG-0x20

push r24
push r25

push YL
push YH
push ZL
push ZH

ldi YL, lo8(srcPos)
ldi YH, hi8(srcPos)         ; get address of index

ld r24, Y+
ld r25, Y                   ; read value of index into registers

add r24, r24
adc r25, r25                ; value descripes index of an int (1 int = 2 bytes) array, so we double it

ldi r30, ((SRC_ARR_SIZE*2) & 0x00ff)
ldi r31, ( (SRC_ARR_SIZE*2) >> 8 )  ; load max arraySize in bytes

cp r24, r30
cpc r25, r31                ; compare if actual index is lower than array size

BRLO noZeroing
ldi r24, 0x0
ldi r25, 0x0                ; if not lower, then we start again at 0

noZeroing:

ldi ZL, lo8(srcArray)
ldi ZH, hi8(srcArray)       ; get address of array

add ZL, r24
adc ZH, r25                 ; add address of array with offsetvalue in Z-registers

clc                         ; clear any c-flag that might be set for ROR
ROR r25
ROR r24                     ; divide by two because it was int and we store index and ...

adiw r24, 0x01              ; ... increment index and then ...

st Y, r25                   ; ... store back the index. (r24/25 is free to use from here on)
st -Y, r24

lds r24, ADCL
lds r25, ADCH               ; read adc value

st Z+, r24
st Z+, r25                  ; store value to array address pointed by Z

pop ZH
pop ZL
pop YH
pop YL

pop r25
pop r24
out SREG-0x20, r18
pop r18
reti

The c equivalent:

ISR(ADC_vect){
    srcArray[srcPos] = ADCL | (ADCH << 8);
    srcPos++;
    if(srcPos >= SRC_ARR_SIZE)
        srcPos = 0;
}  

With the answer from below i now created this version (now only 42 clocks), with using only an arraysize of below 256, since i otherwise would have a shortcoming of codeexecution outside of the interrupt (filling more than 256 values in a fraction of a millisecond):

.org 0x00

srcArray:   .space (SRC_ARR_SIZE*2)
srcArrPtr:  .space 2

ADC_vect:
push r18
in r18, SREG-0x20
push YL
push YH
push ZL
push ZH

ldi YL, lo8(srcArrPtr)      ; get address of ptr (+2 for predecrement)
ldi YH, hi8(srcArrPtr)      ; YH is constant

ld ZL, Y+                   ; read the pointer to Z
ld ZH, Y                    ; Y now is on the highbyte of ptr

lds YL, ADCL                ; reuse YH to load adc value
st Z+, YL                   ; to *ptr++
lds YL, ADCH
st Z+, YL

ldi YL, lo8(srcArrPtr)      ; this saved 1 push and 1 pop with the use of YL above

cp ZL, YL

BRLO noReset
ldi ZL, lo8(srcArray)       ; reset next address to write

noReset:
st Y, ZL                    ; write back the ptr low btye ( the highbyte stays constant)

pop ZH
pop ZL
pop YH
pop YL
out SREG-0x20, r18
pop r18
reti
Tobey66
  • 73
  • 9
  • does the index really need to be 16bit? Or is maybe 8bit or 9bit sufficient? – yar Sep 16 '17 at 16:11
  • 1
    I'm voting to close this question as off-topic because code review questions are off-topic for Stack Overflow. You may be able to get help on https://codereview.stackexchange.com/ – Ross Ridge Sep 16 '17 at 17:07
  • how is 9bit supposed to make a difference? Isnt handling an additional bit as handling an additional byte? @yar – Tobey66 Sep 16 '17 at 18:05
  • You don't need to fully handle the additional bit, you can just branch for the high byte. If it's 1, just add 256 to your array addres and do the rest of the operations with srcPos with only 1 byte. – yar Sep 16 '17 at 18:13
  • @yar i dont really see how i could improve the timing with this, as i have to check, branch and save also this single bit in some places. – Tobey66 Sep 16 '17 at 18:28
  • I haven't programmed it, it's just an idea that might save one or two cyckles. You're right, it's not something that will save you a lot. – yar Sep 16 '17 at 19:01
  • Your assembly code looks a lot different than the C code, or at least it's in a different order, which makes it harder than necessary to understand what you are doing. I also suspect you should just stick to C and not bother with assembly; you never said why it's important for this ISR to be super fast and why a few extra cycles might matter. – David Grayson Sep 16 '17 at 21:45

2 Answers2

2

Use the c-equivalent code of

ISR(){
  *ptr++=lo + hi*256;
  if (ptr==end) ptr=begin;
}

This should convert to perhaps half of your current assembly. Additional optimization can be done with careful placing of the variables -- e.g. placing the ptr at the end reduces the number of constants/addresses.

ADC_vect:
push r18
push r19
in r18, SREG-0x20

push YL
push YH
push ZL
push ZH

ldi YL, lo8(ptr + 2)
ldi YH, hi8(ptr + 2)       ; get address of ptr (+2 for predecrement)

ld ZH, -Y                  ; read the pointer to Z
ld ZL, -Y                  ; leaving Y==end

lds r19, ADCL              ; reuse r19 to load adc value
st Z+, r19                 ; to *ptr++
lds r19, ADCH
st Z+, r19

cp ZL, YL
cpc ZH, YH                 ; compare if actual index is lower than array size

BRLO noReset
ldi ZL, lo8(srcArray)      ; reset next address to write
ldi ZH, hi8(srcArray)      ; to the beginning of srcArray

noReset:
st Y+, ZL                  ; write back the ptr
st Y+, ZH

pop ZH
pop ZL
pop YH
pop YL

out SREG-0x20, r18
pop r19
pop r18
reti
Aki Suihkonen
  • 19,144
  • 1
  • 36
  • 57
  • I dont understand how i should implement your suggestion into my code. What is lo and hi? More variables would generate more assembly code i guess... – Tobey66 Sep 16 '17 at 18:11
  • 1
    The point is that you store a pointer to the next position you are going to write to. That is smarter than the original code because then you don't have to perform 16-bit addition in the ISR to compute the address to write to, and you will also have fewer registers in the context that gets saved. And `lo` and `hi` would be the ADC result registers. – David Grayson Sep 16 '17 at 21:38
  • The `begin` and `end` are not variables but constant literals (either at fixed position or relocatable). Begin is actually `srcArray` and end corresponds to `srcArray + ARRAY_SIZE`. – Aki Suihkonen Sep 17 '17 at 05:30
  • The inner logic is now reduced from 26 instructions to 14 -- almost 50% reduction. (And one variable is saved from the prologue/epilogue). – Aki Suihkonen Sep 17 '17 at 12:21
  • Very smart! Thats pure awesomness! – Tobey66 Sep 17 '17 at 21:18
  • This is how i placed ther variables (its mixed c/asm): `.section .data .org 0x00 destArray: .byte (DEST_ARR_SIZE*2) .org (DEST_ARR_SIZE*2) srcArray: .byte (SRC_ARR_SIZE*2) .org (DEST_ARR_SIZE*2)+(SRC_ARR_SIZE*2) srcArrPtr: .byte 2` But it sometimes gives me the error: `value 0x100 truncated to 0x0` – Tobey66 Sep 17 '17 at 22:41
  • @Tobey66: Looks like your assembler doesn't understand that .byte 0x100 is an array of 0x100 bytes. This can be confirmed if you omit the `org xx` directives, because you shouldn't need to calculate those manually in the first place. Some assemblers use `.space 0x100` to allocate 256 bytes. Also, if your array actually is located on the same 256 byte page, you don't need even store or compare YH and you can save two to three instructions. And further, ptr mod 256 would be completely branch free -- just never store the high byte! – Aki Suihkonen Sep 18 '17 at 05:49
  • @Aki Suihkonen If i omit the .org directive, then the placing of those variables gets wrong sometimes. But i can omit the size after .byte / .space (both work). With page you mean from 0x100 to 0x1ff ? I know nothing about pages and i cant find a good explanation. – Tobey66 Sep 18 '17 at 07:16
  • @Tobey66: Yes, from 0x100 to 0x1ff would be a page in this mixed 8/16-bit architecture. If one works within a page, then the high parts of the addresses are constant. – Aki Suihkonen Sep 18 '17 at 07:37
  • If i use `.byte`, then the variables are messed up. `.space` seems to work fine and does not give this value truncated errror! – Tobey66 Sep 18 '17 at 09:52
0

Since you are always incrementing by 1 (or 2, in the ASM code), instead of

if(srcPos >= SRC_ARR_SIZE)
    srcPos = 0;

you could do

if(srcPos == SRC_ARR_SIZE)
    srcPos = 0;

If you set SRC_ARR_SIZE as power of 2, this statement becomes the same as

srcPos &= ~SRC_ARR_SIZE;

So basicly its just clearing one bit! Since you use int, I expect that SRC_ARR_SIZE > 255, so the bit must be cleared in the high bythe. So it is sufficient to do something like

andi r25, ~((SRC_ARR_SIZE*2) >> 8)
yar
  • 1,855
  • 13
  • 26
  • I thought this would not improve the outcome. Because i tested it, now i know. With Atmel Studio: my C-code takes 108 clocks, your suggestion takes 109 clocks. My assembler is 65 clocks. – Tobey66 Sep 16 '17 at 18:01
  • I meant not to write it as C, but translate this additional step to ASM. replacing `cp r24, r30; cpc r25, r31; BRLO noZeroing; ldi r24, 0x0; ldi r25, 0x0;` by 1 ASM instruction (AND-ing the high byte with the high byte of ~SRC_ARR_SIZE) – yar Sep 16 '17 at 18:14
  • Why only the High byte if there is also information on the low byte? – Tobey66 Sep 16 '17 at 19:22
  • If `SRC_ARR_SIZE` is a power of two and `SRC_ARR_SIZE > 255`, it has only zeros in the low byte, so AND-ing with the low byte of `~SRC_ARR_SIZE` is a nop. – yar Sep 16 '17 at 19:34
  • i guess you are wrong or i am missunderstanding. As im counting below 256 there i have to be active. The low byte of binary ~ is only 1 in the lower if > 255? – Tobey66 Sep 16 '17 at 19:40
  • I worked out this: `cp r24, r30 cpc r25, r31 BRLO noZeroing ldi r24, 0x0 ldi r25, 0x0 noZeroing:` is replaced by: `com r30 com r31 and r24, r30 and r25, r31` It just has the same clocks now, but at least it is always the same cycle length now because there is no branching anymore. – Tobey66 Sep 16 '17 at 19:48
  • I edited the answer, I think it should be clearer now. – yar Sep 16 '17 at 22:00