3

I'm trying to write a real mode bootloader and I'm currently having problems trying to enable the A20 line. Here's my code so far, I'm assembling with NASM:

[bits 16]

[global _start]

jmp _start

bios_print:
 lodsb
 test al, al
 jz bios_print_done
 mov ah, 0x0E
 mov bh, 0
 int 0x10
 jmp bios_print

bios_print_done:
 ret

a20_is_enabled:
 push ds
 push si
 push es
 push di

 xor ax, ax
 mov ds, ax
 mov si, BOOT_ID_OFFS

 mov ax, BOOT_ID_OFFS_PLUS_1MB_SEGM
 mov es, ax
 mov di, BOOT_ID_OFFS_PLUS_1MB_OFFS

 cmp word [es:di], BOOT_ID

 mov ax, 1
 jne a20_is_enabled_done

 mov ax, word [ds:si]
 xor ax, ax
 mov [ds:si], ax

 cmp word [es:di], BOOT_ID

 push ax
 xor ax, ax
 mov [ds:si], ax
 pop ax

 mov ax, 1
 jne a20_is_enabled_done

 mov ax, 0

a20_is_enabled_done:
 pop di
 pos es
 pop si
 pop ds

 ret

a20_enable_bios:
 mov ax, 0x2403
 int 0x15
 jc a20_enable_bios_failure
 test ah, ah
 jnz a20_enable_bios_failure

 mov ax, 0x2401
 int 0x15
 jc a20_enable_bios_failure
 test ah, ah
 jnz a20_enable_bios_failure

 mov ax, 1
 jmp a20_enable_bios_done

a20_enable_bios_failure:
 mov ax, 0

a20_enable_bios_done:
 ret

a20_enable:

 push si

 mov si, word MSG_A20_TRY_BIOS
 call bios_print

 pop si

 call a20_enable_bios

 test ax, ax
 jz a20_enable_failure

 call a20_is_enabled

 test ax, ax
 jnz a20_enable_success

a20_enable_failure:

 push si

 mov si, word MSG_A20_FAILURE
 call bios_print

 pop si

 mov ax, 0
 jmp a20_enable_done

a20_enable_success:

 push si

 mov si, word MSG_A20_SUCCESS
 call bios_print

 pop si

 mov ax, 1

a20_enable_done:
 ret

_start:
 xor ax, ax
 mov ds, ax

 cld

 cli

 push si

 mov si, word MSG_GREETING
 call bios_print

 pop si

 call a20_enable

 test ax, ax
 jz boot_error

 ; TODO

boot_error:
 jmp boot_error

BOOT_ID equ 0xAA55
BOOT_ID_OFFS equ 0x7DFE
BOOT_ID_OFFS_PLUS_1MB_SEGM equ 0xFFFF
BOOT_ID_OFFS_PLUS_1MB_OFFS equ BOOT_ID_OFFS + (0x1 << 20) - (BOOT_ID_OFFS_PLUS_1MB_SEGM << 4)

MSG_GREETING db 'Hello from the bootloader', 0xA, 0xD, 0
MSG_A20_TRY_BIOS db 'Trying to enable A20 line via BIOS interrupt', 0xA, 0xD, 0
MSG_A20_SUCCESS db 'Successfully enabled A20 line', 0xA, 0xD, 0
MSG_A20_FAILURE db 'Failed to enable A20 line', 0xA, 0xD, 0

times 510-($-$$) db 0
dw BOOT_ID

The problem is the function a20_is_enabled which is supposed to check if the A20 line is enabled after a20_enable_bios has activated it via a BIOS interrupt (I know this is not foolproof, more code will follow here). When I debug the code everything seems to be fine until call a20_is_enabled. The processor does then indeed perform a near call to to correct address here but no return address is pushed onto the stack (which I have verified with gdb). So when ret is executed in a20_is_enabled, the instruction pointer is set to some garbage address. Why is this?

EDIT: note that there is not ORG 0x7C00 at the beginning of my assembly code. This is because I first create an elf file so that I can debug my code using gdb and that doesn't play well with ORG, So I actually do this:

nasm -f elf32 -g -F dwarf boot.asm -o boot.o
ld -Ttext=0x7c00 -melf_i386 boot.o -o boot.elf
objcopy -O binary boot.elf boot.bin
Michael Petch
  • 46,082
  • 8
  • 107
  • 198
Peter
  • 2,919
  • 1
  • 16
  • 35
  • 2
    I highly recommend BOCHS for real mode debugging of things like bootloaders and other real mode code. Unlike GDB, BOCHs has proper understanding of real mode segment:offeset addressing. BOCHs only has limited symbolic debugging but this is usually not a big deal for stepping through things as small as a bootloader. – Michael Petch Nov 15 '20 at 13:35
  • @MichaelPetch: I have not used BOCHs before but that sounds sensible. As to org 0x7c00: that is sort of related to the fact that I use gdb, I will edit the question. – Peter Nov 15 '20 at 13:54
  • @MichaelPetch: I have updated the question, the ORG line is missing because I create an ELF file with debug information and that doesn't seem to work with ORG. Maybe the command I use to link is not appropriate. – Peter Nov 15 '20 at 13:59

1 Answers1

7

Normally one might close this question as it is caused by a typographical error but the error isn't necessarily obvious at first. One has to pay close attention in a debugger observing the instructions that are being executed.

This had me scratching my head since when I looked in the debugger the sequence:

 push ds
 push si
 push es
 push di

 ; Snip other code

 pop di
 pos es
 pop si
 pop ds
 ret

only showed the processor executing 3 POPs and a ret when there are clearly 4 POP instructions. Because the processor isn't doing enough POPs the return address is incorrect and ret returns to the wrong part of memory and causes unexpected behavior.

The problem is rather trivial and because of a stroke of bad luck an instruction is produced without error but isn't the instruction you want. If you look closely this is the culprit:

 pos es

There is a typo. POS should be POP. My brain didn't catch it at first. pos is being treated as a label and es is a segment override so can appear on a line by itself. This caused the instruction es pop si to be produced.

Clearly the fix is to change it to:

 pop es
Michael Petch
  • 46,082
  • 8
  • 107
  • 198
  • 2
    Oh my God, I have looked at this for minutes and didn't see it. Great catch. – Peter Nov 15 '20 at 14:44
  • 4
    Nice spotting. I've wanted to get an optional warning for this into NASM for years. Refer to [Add label-no-colon warning](https://bugzilla.nasm.us/show_bug.cgi?id=3392632). – ecm Nov 15 '20 at 18:06
  • 2
    @ecm : I agree. The warning about a label with no colon on a line with an instruction would have been handy here lol. This was one of the first questions after waking up and I didn't realize the typo after looking at the original code repeatedly. BOCHs didn't do me any favors except to tell me there were 3 pop instructions. It remained a mystery to me until I did an objdump of the code and objdump was kind enough to point out the instruction was `es pop si`. It would have been handy for BOCHs to tell me that was what was encoded. Once I saw that and finally looked close the problem was obvious – Michael Petch Nov 15 '20 at 18:24
  • 3
    @ecm : what is funny is that when I got to figuring it out I was about to leave a comment and point out the problem and close it as a simple typographical error and thought hell I spent enough time on this dumb problem that maybe it is worthy of an answer for the future. I am not a fan of labels without colons. MASM didn't help since there are rules about the colon being in data and code that I never liked. Anyway, I'd like to see BOCHs show us a better decoding and NASM with more warnings and I think your patch is a good idea for that and the other cases. – Michael Petch Nov 15 '20 at 18:27
  • 3
    100% agreed. If NASM doesn't warn you about this by default, that's bad. If it can't warn you about this *at all*, that's terrible. `label: prefix` on a line by itself seems really unlikely, and if you do want that you can do it with a colon to make it clear. `label prefix` should IMO never appear on a line by itself in sane source code, even if the label isn't easily mistakeable for an instruction. An option to simply disable accepting non-`:` labels could also or instead be useful, @ecm, to let the assembler enforce standard NASM style. – Peter Cordes Nov 16 '20 at 00:24
  • @Peter Cordes: If label-no-colon was accepted as a warning, you could specify it as a warning that should be treated as an error. That would serve the same purpose as to "disable accepting non-`:` labels". – ecm Nov 16 '20 at 03:51
  • 1
    If NASM maintainers decide to accept `:` as an optional label terminator, I advocate to tolerate this terminator both when the symbol is *defined* and *refered*, see https://euroassembler.eu/eadoc/#SymbolName – vitsoft Nov 16 '20 at 07:30