0

I'm learning linux kernel source code.

And I've already got some basic idea about assemly language, like the usage of general instructions(such as mov, add, jmp, call...), the difference between AT&T type and Intel type.

So for now, it isn't a big problem for me to understand the rough idea of what these asm code is doing. But these directives like .text .data showing at the head and tail of the following code confuse me a lot.

So, my direct question is what is the meaning of the .text pair, .data pair? My root question is what is the asm version or type these syntax based on? I think it is a subversion of Intel as there is no '$' before constants. But why there is '#' and '_start' instead of 'main'? Where could I find a complete introduction of all these related asm grammar?

Help me, please!

Thanks a lot!

.globl begtext, begdata, begbss, endtext, enddata, endbss
.text
begtext:
.data
begdata:
.bss
begbss:
.text

BOOTSEG  = 0x07c0           ! original address of boot-sector
INITSEG  = 0x9000           ! we move boot here - out of the way
SETUPSEG = 0x9020           ! setup starts here

entry _start
_start:
    mov ah,#0x03        
    xor bh,bh
    int 0x10
... ...

.text
endtext:
.data
enddata:
.bss
endbss:
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847

2 Answers2

0

Complete introduction of all these related asm grammar you will find in the documentation of the assembler. Unlike machine instructions, such as mov, add, jmp, call,,, there are many directives and pseudoinstructions which are not normalized and they depend on personality and experience of authors of the language.

Statement .globl begtext, begdata, begbss, endtext, enddata, endbss declares that some labels (defined somewhere in the source text later) will be GLOBAL alias PUBLIC, i.e. they can be accessed from other program modules linked with the kernel.

Labels .text, .data, .bss are directives which tell the assembler to redirect its output (emitted code and data) to a particular segment. Executable program file contains one segment with machine instructions (.text alias .code) and segments for data (.data, .rodata, .bss), but the source text doesn't have to be written in this order. Imagine that Linus (or whoever has written the source) is the boss who dictates the source code to his secretary (an Assembler). He tells it to emit three instructions which use BIOS service INT 0x10 to switch the console to 80*25 text mode:

.text
 mov ah,#0x03        
 xor bh,bh
 int 0x10

Secretary will write (emit) those instructions on a sheet of paper labeled .text. Then Linus decides to define some message, so he tells the secretaty the directive .data, followed by the message definition:

.data    
 Message DB "Kernel is starting, please wait."

Secretary will grab another sheet of paper, label it .data and writes the Message definition on it. When Linus decides to dictate other machine code, secretary takes back the sheet .text and continues writing at the spot (origin) where it was interrupted - below the instruction INT 0x10. In this way they may alternate the output segments ad libitum and keep the data near the code which manipulates with it (this is good for readability of the program). Finally, all paper sheets will be stapled (linked) together, so all machine instructions end up near each other in .text segment, and similary data in .data segment.

vitsoft
  • 5,515
  • 1
  • 18
  • 31
  • Thanks a lot! Most of my confusion is solved now. But about the directives, you mentioned that it is not normalized, and could vary from authors to authors. So I wanna ask, how the assembler recognizes and processes so many different styles of the directives? – WilliamAllwaysWin Jun 18 '21 at 14:45
  • Assemblers **do not recognize** directives which belong to other assemblers, they would complain with error or misinterprete the foreign directives. When you want to switch to `.text` segment/section in [NASM](https://www.nasm.us/xdoc/2.15.05/html/nasmdoc7.html#section-7.3), you have to use `SECTION .text`. In [MASM](https://learn.microsoft.com/en-us/cpp/assembler/masm/dot-code?view=msvc-160) it is `.CODE .text`. In [€ASM](https://euroassembler.eu/eadoc/#Sections) it is `[.text]`. – vitsoft Jun 18 '21 at 18:48
0

You are looking a different thing: boot code.

The boot code must be 16-bit, and it is special. It also requires a different assembler (16 bit, real-mode).

So, we are in 16-bit real mode, so we have segments. The .text is for the CS segment (so offset on .text are relative to CS). .data is expected for DS segment, but this can be changed (and you can tell assembler, so that it will know hot to calculate the offsets).

Note: the boot code is also special because BIOS load boot code in 07C0h:0000h, and run it, but we move it, and run the second part, so the same code will have different CS segments (you may find some LONG JUMP which may seem unnecessary). Then load rest of boot code. then setup stuffs (at BIOS level and hardware level e.g. restore A20 line), then prepare the memory tables (we have still just the first MB of memory available (and some it is used by BIOS) for protected and 32-bit mode, so that we can switch, flush caches, and execute in 32-bit. And then the rest of set-up.

EDIT: add more explanation.

We are in 16 bit real mode, so the memory addresses are calculated with two components: segment and offset (both are 16-bit long, segment is shifted 4 bit). So current code is in CS:IP and usually we read data in DS segment (few exceptions, or we can explicitly give the segment).

When you assemble the code, you get different part in different segments.

.text
begtext:

The first line declare: we are now in .text segment (text is usually for code). Next line we declare a label: begtext, which I assume it is for beginning of our text segment.

You do the same for other segments (.data and .bss). The reason is that we want to know the initial offset, so we can move the data (we do not know if assembler or loader will put as at different offset, so it is safe to define a label at beginning, and so we can use it, if we need to move the code). You see the same at the end of the file (e.g. endtext), to know the last point, and so the size of used segment (endtext - begtext).

Note: we cannot move the entire segment, because segments may overlap, or CS may be DS, just with different offset, and so we risk to overwrite wrong part (in case we move code and data in a different order).

entry: defines the entry point: you want that the loader will start at _start. But this depends on the format of the resulting file (so assembler directives). Sometime you have org where you explicitly tell the current position (e.g. in DOS .com files we had org 100h, IIRC, and that was the entry point called by DOS).

You have also the definition of some constant (07c0 is where boot sectors are loaded and executed, by BIOS on PC).

.glovl is just to declare that values as "global", so exported and available on other modules, if you want to "link" several files.

Giacomo Catenazzi
  • 8,519
  • 2
  • 24
  • 32
  • Thank you Giacomo, I've read your answer carefully. I think you are stating the rough idea of the boot code. But for now, what I really wanna know is just the meaning of these directives, so that I can read and understand the bootsect.s code on my own. Could you explain more about the directives, especially about the different styles? – WilliamAllwaysWin Jun 18 '21 at 14:56