4x4 matrix multiplication: Exception 4: Unaligned Address in inst/data fetch: 0x100100bb

Question

I'm trying to to do a 4x4 Matrix multiplication using Assembly in MIPS simulator (QtMips). QtMips gives me Exception 4: Unaligned Address in inst/data fetch: 0x100100bb

This is where I get the error when I single step.

    [00400070] c52b0000  lwc1 $f11, 0($9) ; 80: lwc1 $f11 0($t1) #load float from array1

The error happens when counter k = 2, meaning when it is at the third loop. I'm assuming something is wrong with 32-bit alignment at my third load, lwc1

Here's what I tried/read but didn't work:

This suggests that I put .align 2 or .align 4 before my array (matrix) declaration in .data. Didn't work.
This suggests that it could be the issue of the size value (defined after array3). But I'm loading this to s1 by lw $s1 size so I don't see this being a real issue for me.

I'm very lost on what to do. Please impart me some wisdom.

Below is my whole code:

    # here's our array data, two args and a result
    .data
    .globl array1
    .globl array2
    .globl array3

    .align 5 #align the data set
array1: .float 1.00, 0.00, 3.14, 2.72, 2.72, 1.00, 0.00, 3.14, 1.00, 1.00, 1.00, 1.00, 1.00, 2.00, 3.00, 4.00
    .align 5 #align the data set
array2: .float 1.00, 1.00, 0.00, 3.14, 0.00, 1.00, 3.14, 2.72, 0.00, 1.00, 1.00, 0.00, 4.00, 3.00, 2.00, 1.00
    .align 5 #align the data set
array3: .float 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00

size: .word 4 #store float in s2

    .text
    .globl main
main:
    sw $31 saved_ret_pc

    .data
lb_:    .asciiz "Vector Multiplication\n"
lbd_:   .byte 1, -1, 0, 128
lbd1_:  .word 0x76543210, 0xfedcba98
    .text
    li $v0 4    # syscall 4 (print_str)
    la $a0 lb_
    syscall

# main program: multiply matrix 1 and 2, store in array3

la $t1 array1
la $t2 array2
la $t3 array3 ###load arrrays to registers


li $t4 4 # i loop counter    -> I changed addi to li
li $t5 4 # j loop counter
li $t6 4 # k loop counter

lw $s1 size # load matrix(array) size


i_loop:
    j j_loop
j_loop:
    j k_loop
k_loop:
    #f0 and f1 - float func return values
    #f10 - multiplication return values
    #f4, f5 - register to store addr offset

    lwc1 $f11 0($t1) #load float from array1
    lwc1 $f12 0($t2) #load float from array2
    lwc1 $f13 0($t3) #load float from result array3
    nop 
    mul.s $f10 $f11 $f12 #multiply floats, store result as temp in $f10
    nop

    add.s $f13 $f13 $f10 #add to multiplication result to resulting array3

    swc1 $f13 0($t3) #store the resulting float in array3

#call index_of_A
    move $s0 $ra    #save return address into s0
    nop
    jal index_of_A  #get addr offset for array1
    nop
    move $ra $s0    #restore return address that was saved into s0

#call index_of_B
    move $s0 $ra    #save return address into s0
    nop
    jal index_of_B  #get addr offset for array2
    nop
    move $ra $s0    #restore return address that was saved into s0

    add $t1 $t1 $s2 # next address in the array1
    add $t2 $t2 $s3 # next address in the array2
    addi $t3 $t3 4 # next address in the array3

    addi $t6 $t6 -1 #decrease k counter
    bne $t6 $0 k_loop #repeat k_loop

    addi $t5 $t5 -1 #decrease j counter
    bne $t5 $0 j_loop #repeat j_loop

    addi $t4 $t4 -1 #decrease i counter
    bne $t4 $0 i_loop #repeat i_loop

#used regs: f0-f5, f10-13
index_of_A: #function for array1 addr offset    #may need to convert all to float first
    #size*i + k #$f20*i + k
    mul $s2 $s1 $t4 # 4*i, 
    add $s2 $s2 $t6 # + k, store in $s2
    jr $ra #jump back to the caller


index_of_B: #function for array2 addr offset
    #4*k + j
    mul $s3 $s1 $t6 # 4*k, 
    add $s3 $s3 $t5 # + j, store in $s3
    jr $ra #jump back to the caller


# Done multiplying...
    .data
sm: .asciiz "Done multiplying\n"
    .text
print_and_end:
    li $v0 4    # syscall 4 (print_str)
    la $a0 sm
    syscall

# Done with the program!
    lw $31 saved_ret_pc
    jr $31      # Return from main

#Terminate the program
    li $v0, 10
    syscall

.end main

But I don't understand what's wrong since the same exact code works on my another example here:

I don't speak MIPS, but as a rule-of-thumb, I'd expect floats to be at least 32bits. Such being the case, your `.align 2` would be insufficient. Also, 0x100100bb is an odd number (literally, the 1 bit is set). That can't be a good thing for a system that requires aligned reads. I'd check the math that you use to increment this pointer to your array. — David Wohlferd, Jul 03 '18 at 04:00
@DavidWohlferd Just realized that error right after I post this so I manipulated it to .align 5 and also fixed the size of my array to 4. But still doesn't work. :( — Leonard, Jul 03 '18 at 04:11
I expect the right answer is `.align 4`. However, that isn't your only problem. If your array is located at say 0x100, then you can load that value into a register and use it to read a value from your array, since it's nicely aligned. But if you increment it by 1 (0x101), that's not aligned anymore. You need to be incrementing your register by the size of the elements in your array (which I suspect is 4). — David Wohlferd, Jul 03 '18 at 04:19
@Leonard: `.align` in GAS-like syntax either takes a power of 2, or an exponent. If your assembler didn't complain about 5, then it treats `.align` as a synonym `.p2align`, so you were aligning to a 2^5 = 32 byte boundary. — Peter Cordes, Jul 03 '18 at 04:27
@DavidWohlferd Tried .align 4 and didn't work. As you mentioned, perhaps the way I access the next element in the first matrix is a bit off. I'll try to check the address increment and modify my function index_of_A — Leonard, Jul 03 '18 at 04:46
@PeterCordes Okay so since floating point number is 32-bit aka 8 byte, I need to make it 2^4? I don't understand your comment since I don't know what GAS like means and the difference between .align and .p2align. (Googled it but the documentation explanations just fly over my head...) — Leonard, Jul 03 '18 at 04:54
`.align 4` should be fine. `1<<4 = 2^4 = 16`. 32 bits is *4* bytes. More alignment than necessary isn't going to hurt. — Peter Cordes, Jul 03 '18 at 05:01
Since you are running this under a debugger, it should be a simple matter. From the comments, you are using `add $t1 $t1 $s2 # next address in the array1` (which I interpret to mean t1 = t1 + s2) to go to the next element. What does t1 contain before and after this instruction? Are they both evenly divisible by 4? And what's in s2? I doesn't see anything that explicitly assigns a value to it. Does this somehow happen implicitly on mips? — David Wohlferd, Jul 03 '18 at 05:14
@DavidWohlferd Yes, just before you comment I found a HUGE error relating to your comment. I forgot that I was counting my loop counter DOWN from 4 to 0, instead of going from 0 to 4 as in C or Python for-loop. This created the problem when I was accessing elements in my array1 and 2. Also yes, $s2 is the size of the matrix, but it shouldn't be there. I fixed it so now it is `add $t1 $t1 4` now. Array2 is trickier since I need to move from one row to another. — Leonard, Jul 03 '18 at 05:19
In your case even `.align 2` (2^2=4) should be enough, BTW how-to debug: if the load of first element works, then you know you have your array aligned right, because the alignment can't change between elements, as they use consecutive words (32 bit) chunks of memory. But you should have verified that your code would access `array1+4` address for second element, and `array1+(size*4)` for first element of next row, i.e. run your element address calculation in head or in small extra piece of code, and verify the offsets values (part of address added to base `array1` address) are as expected. — Ped7g, Jul 03 '18 at 05:44
@DavidWohlferd I think I almost hunted these bugs down but ran into `Bad address in data/stack read: 0x00000000` now.. Could you please take a look at this? https://codeshare.io/5ovKDb Thanks so much for helping out. — Leonard, Jul 03 '18 at 05:44
In MARS it fails inside loop at `lwc1 $f12 0($t2) #load float from array2`, when `t2 = 0`, so it tries literally to read from address 0x00000000. Your code doesn't make sense, you load `array2` into `t2` at init, but then you clear `t2` by moving uninitialized `s4` inside the loop. EDIT: Are your matrices of fixed 4x4 size, or is the code expected to work over generic sizes? — Ped7g, Jul 03 '18 at 05:50
@Ped7g Right I realized that it is not initializing $s4 in my QtSpim as well. I have the move $t2 $s4 instruction however, and I'm really confused as to why it doesn't get executed. Do you have any idea? I'm also fixing the index accessing. — Leonard, Jul 03 '18 at 05:52
everything gets executed up till the exception happens, so I'm not sure what you mean. Try to single-step over it in debugger or in head, your code logic is a bit broken, like you probably expected `s4` to contain something meaningful, what is maybe put there LATER, etc... you need to be very precise and sequential in assembler, watching how it operates in debugger while single stepping often helps a lot to get into that mood. And maybe draw some algorithm notes on paper, so you can easily verify the basic assumptions while inside loop/etc. The ASM itself is so simple it sometime hurts a bit;) — Ped7g, Jul 03 '18 at 05:56
BTW, why don't you use commas in instructions, like `lwc1 $f11 0($t1)` -> `lwc1 $f11, 0($t1)`? The official syntax requires comma between operands, it's just SPIM/MARS assembler quirk, that it accepts even arguments split by space character only, so while your way works for you, it makes the source a bit harder for my eyes, because my mind is reporting syntax error on almost every line, when I have to remind myself that it's actually ok in spim/mars simulators (not in other MIPS assemblers, like gas (btw GAS = "GNU Assembler"). — Ped7g, Jul 03 '18 at 06:00
@Ped7g Thank you. I'm re-working on writing/drawing my pseudocode now. I didn't have commas just because my professor didn't use it in his example. Thanks for letting me know! — Leonard, Jul 03 '18 at 06:03
@Ped7g I'm just trying to make this work on 4x4 size for now. I'm trying to something like this: `#How to access next elements in array 1 # 4*i + k #incr_size = 4 #addr_of_array1[0] + incr_size*4*i + k` Basically, I'm trying to access the elements in the array by 1) storing the original address of array1[0], 2) calculate the offset 3) add the offset to its index to access the right element. I'm now fixing my logic to calculate the offset, however, I was surprised that I couldn't store the original address earlier with `move $t2 $s4`. — Leonard, Jul 03 '18 at 06:08
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/174211/discussion-between-leonard-and-ped7g). — Leonard, Jul 03 '18 at 06:15
@Ped7g Pinging you and some others here. Before jumping into the more complicated matrix multiplication, I instead wrote a code that accesses the elements in the matrix correctly in column-major order. https://codereview.stackexchange.com/questions/197857/mips-assembly-program-to-access-elements-in-the-4x4-matrix-row-or-column-major — Leonard, Jul 05 '18 at 05:58

score 2 · Accepted Answer · answered Jul 08 '18 at 03:36

4x4 Matrix multiplication Okay, so I figured it out so I am answering my own question.

I learned many things along the way and those include

.align is not necessary to run the code. It works without them. Perhaps I didn't need it for this specific situation.
$f12 and $f13 is reserved specifically for printing out floats. If you save the float somewhere else, it won't print.
The first offset calculation I made is 0, which is why I need to add it at the top of the loop, instead of at the end. That's what was causing all the trouble.
Be sure to calculate your index correctly. Look at my code comment to see what I do :)

Here is the final version of my code that works! You can see my GitHub for the matrix multiplication Python, C, and Assembly.https://github.com/leochoo/cmpa

.data

#define matrices
    .globl A
    .globl B
    .globl R

    .align 4 #align the data set
    A: .float 1.00, 0.00, 3.14, 2.72, 2.72, 1.00, 0.00, 3.14, 1.00, 1.00, 1.00, 1.00, 1.00, 2.00, 3.00, 4.00
    .align 4 
    B: .float 1.00, 1.00, 0.00, 3.14, 0.00, 1.00, 3.14, 2.72, 0.00, 1.00, 1.00, 0.00, 4.00, 3.00, 2.00, 1.00
    .align 4 
    R: .float 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00


    matrix_size: .word 4 #row and column size
    float_size: .word 4 #float is 4-byte in MIPS.
                        #i.e. 4-byte will take up 16-bit blocks in the memory,
                        #hence in hexadecimal address, 
                        #array[0] at 10010040, array[1] at 10010050.

    tempSum: .float 0.00 #initialize tempSum as 0

    lineBrk: .asciiz "\n"


    #For debugging
    arr_1: .asciiz "A: "
    arr_2: .asciiz " B: "
    arr_3: .asciiz " R: "
    i_:     .asciiz " i:"
    j_:     .asciiz "j:"
    k_:     .asciiz "k:"
    space_: .asciiz " "
    bar_:   .asciiz " | "





#TEXT (MAIN) SECTION - multiply matrix 1 and 2, store in array3
.text
    .globl main
main:

    #print title
        .data
    lb_:    .asciiz "Vector Multiplication\n"
    lbd_:   .byte 1, -1, 0, 128
    lbd1_:  .word 0x76543210, 0xfedcba98
        .text
    li $v0 4    # syscall 4 (print_str)
    la $a0 lb_
    syscall

#load matrices
la $t1 A
la $t2 B
la $t3 R

#load variables
li $s1 0 # later used to store offset of matrix B
lw $s1 matrix_size # $s1 = matrix_size
lw $s2 float_size # $s2 = float_size
l.s $f5 tempSum #tempSum 


#store base addresses
move $s6 $t1 # $s6 = base address of matrix A stored
move $s7 $t2 # $s7 = base address of matrix B stored



#for i in 0...4:
    #for j in 0...4:
        #for k in 0...4:
    li $t4 0 # i counter
i_loop:
        li $t5 0 # j counter
    j_loop:
            li $t6 0 # k counter
        k_loop:
            #update index of A[i:t4][k:t6]
                # $s0 = offset result
                # $s1 = matrix_size: 4
                # $s2 = float_size: 4
                # $s6 = base address of A

                #calculate offset
                mul $s0 $s1 $t4 # s0 = matrix_size*i
                add $s0 $s0 $t6 # s0 = s0 + k
                mul $s0 $s0 $s2 # s0 = float_size*s0

                #increase by offset
                add $t1 $s6 $s0 # new index = base_addr + offset  ##first loop initialization will always be zero... oh..

            #update index of B[k:t6][j:t5]
                # $s0 = offset result
                # $s1 = matrix_size: 4
                # $s2 = float_size: 4
                # $s7 = base address of B

                #caculate offset
                mul $s0 $s1 $t6 # s0 = matrix_size*k
                add $s0 $s0 $t5 # s0 = s0 + j
                mul $s0 $s0 $s2 # s0 = float_size*s0

                #increase by offset
                add $t2 $s7 $s0 # new index = base_addr + offset


            #load matrix A and B
            lwc1 $f1 0($t1) #load float from matrix A
            lwc1 $f2 0($t2) #load float from matrix B
            nop
                #print i, j, k

                li $v0 4        
                la $a0 i_
                syscall         # "i"

                li $v0 1 
                move $a0 $t4
                syscall         # value of i

                li $v0 4        
                la $a0 j_
                syscall         # "j"

                li $v0 1 
                move $a0 $t5
                syscall         # value of j


                li $v0 4        
                la $a0 k_
                syscall         # "k"

                li $v0 1 
                move $a0 $t6
                syscall         # value of k

                li $v0 4        # " | "
                la $a0 bar_
                syscall 

                #print A and B
                li $v0 4    
                la $a0 arr_1
                syscall

                lwc1 $f12 0($t1) #A
                li $v0 2
                syscall

                li $v0 4    
                la $a0 arr_2
                syscall

                lwc1 $f12 0($t2) #B
                li $v0 2
                syscall


            #Break down: R[i][j] +=  float_size * ( A[i][k] * B[k][j] )
            #### first result: (1*1)+(0*0)+(3.14*0)+(2.72*4)

            # (A * B)
            nop
            mul.s $f0 $f1 $f2 # (a*b)
            nop
            #tempSum:$f5 = tempSum + (A * B)
            add.s $f5 $f5 $f0
            nop
                ####1st = (A*B)
                ####2nd = (A*B) + (A*B)         


            #DON'T UPDATE index of R here
            #you only need to update it 16 times, hence in j_loop

        #k_loop end condition
        addi $t6 $t6 1 # k++
        bne $t6 $s1 k_loop #if k != 4, repeat k_loop


    #store R[i][j] = tempSum:$f5
    swc1 $f5 0($t3) #store the resulting float in array3
    nop

    #reset tempSum = 0
    l.s $f5 tempSum

    #load and print element in R
    li $v0 4    
    la $a0 arr_3 # " R "
    syscall     

    lwc1 $f12 0($t3)
    li $v0 2
    syscall 

    li $v0 4
    la $a0 lineBrk #print( '\n' )
    syscall

    #update index of R[i][j] - same as updating index of A
    add $t3 $t3 $s2


    #j_loop end condition
    addi $t5 $t5 1 
    bne $t5 $s1 j_loop 

#i_loop end condition
addi $t4 $t4 1 
bne $t4 $s1 i_loop


# Done multiplying...
    .data
sm: .asciiz "Done multiplying\n"
    .text
print_and_end:
    li $v0 4    # syscall 4 (print_str)
    la $a0 sm
    syscall

#Terminate the program
    li $v0, 10
    syscall

.end main

You don't need integer `mul` for index calculation. You can do `pointer += stride` inside the loop instead of `i * stride`. You also don't need to *separately* multiply by the matrix size and the element size. First of all, `float_size` is a power of 2, so you can just left shift. And it's a constant, so you can use it as an immediate `sll $s1, $s1, 2`. Second, you can do `matrix_size << 2` and get a single scale factor or byte stride. As a general rule, hoist as much calculation as possible out of loops, so it only runs once ahead of the loop. — Peter Cordes, Jul 08 '18 at 03:45
@PeterCordes You are absolutely right. I'm sure there is a way to improve it a lot better. I'll get back to this to improve it when I can. Thanks! — Leonard, Jul 08 '18 at 03:49

4x4 matrix multiplication: Exception 4: Unaligned Address in inst/data fetch: 0x100100bb

1 Answers1