-2

I am trying to write a strlen function in assembly using 64-bit GAS. I need to get an input string from the user, and print its length. This is my code:

.lcomm d2, 255
.data
pstring1:  .ascii "%s\0\n"

.text
.globl main
main:
    movq %rsp, %rbp 

    subq $8, %rsp   
    movq  $d2, %rsi
    movq  %rsi,%rbx          
    movq  $pstring1, %rdi
    movq  $0,%rax
    call scanf

    movq   $1, %rax
    movq   $d2, %rsi
    movq   $pstring1, %rdi
    call  printf #print to check if scanf worked write

    add   $8, %rsp

    movq 8(%rsp), %rcx
    movq %rcx, d2
    call pstrlen
    popq %rbx   
    ret

    ##########
pstrlen:  

    movq %rsp, %rbx
    movq 16(%rbp),%rdx
    xor %rax, %rax        
    jmp if

then:
    incq %rax
    movq $length,%rax
if:
    movq %rdx, %rcx
    cmp 0, %rcx
    jne then
end:
    pop %rbp
    ret

If someone could explain giving an example of how to work with strings and pass parameters to functions in 64-bit GAS assembly it would be ideal, since I can't find anything suitable online.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • As a courtesy, please fix your formatting if you see it comes out messed up. Click the [edit](http://stackoverflow.com/posts/41452817/edit) link under the post and use the "code sample" button in the toolbar. As for your problem, comment your code better, describe actual and expected behavior, and learn to use a debugger. – Jester Jan 03 '17 at 21:42
  • I didn't check myself how good it is, but just to make your claim *"nothing suitable online"* ridiculous, for example: https://www.youtube.com/playlist?list=PLKK11Ligqiti8g3gWRtMjMgf1KoKDOvME (one whole part is dedicated to gas, and you should try to watch others probably too, from the titles it looks like good basics introductory to things you should understand before even trying to learn x86-64 gas syntax). – Ped7g Jan 03 '17 at 21:52
  • sorry! I have tried to add commetd, but they just gone and somehow I can't make anymore edits soon,some errors. – Nana Mordov Drugobitski Jan 03 '17 at 22:01
  • Unfortunately, I need some help to fix this one because I just dont have enough time, but thank you both very very much! – Nana Mordov Drugobitski Jan 03 '17 at 22:03
  • I have looked at the link. Sorry but the examples are not helping me at all right now. there is no examples of what I need in this Assembly, if you know some other, more relevant links, I would be very glad as I spent last two weeks searching for relevant information, But thanks again very much for your time!! – Nana Mordov Drugobitski Jan 03 '17 at 22:13
  • well, it's not clear, what you need, because the source shows lack of everything (and understanding of basics), so you need everything and that's hard to fix with simple comment/answer. How do you compile the binary out of it? I'm not sure how to use gas, normally I use nasm. And how do you debug it? Your code doesn't even exit correctly, right? (from a quick try in my head it looks like it destroys the stack in incorrect way) – Ped7g Jan 04 '17 at 00:39

1 Answers1

0

On principle level, you are using .lcomm d2, 255 to allocate 255 bytes for the string data. One byte is 8 bits, 1 bit is either 0 or 1. So maximum value of one byte is 28-1 when treated as unsigned binary value. Which is for me the most common way, how I think about bytes (as a number 0..255), but those 8 bits can represent also other values, like sometimes signed 8 bit is used (-128..+127), or particular bits are addressed giving them specific functionality for the particular code accessing them. (this part is good)

Then you use scanf with "%s\0\n" definitions (it will compile as bytes '%', 's', 0, 10 ... not sure what the 10 is good for there after null terminator). I would use .asciiz "%254s" instead, to prevent malicious user entering more that 255 bytes of input into that reserved d2 space. (note it's .asciiz with z at end, so it will add the zero byte on it's own)

Then you use printf. Rather provide another formatting string separately for output, this time like formatOut: .asciiz "%s\n".

Finally you want strlen.

Which means I will return back to input. If you are running in normal 64b OS (linux), your input string is very likely UTF-8 encoded (unless your OS is set in other specific Locale, then I'm not sure which Locale will scanf pick up).

UTF-8 encoding is variable-length encoding, so you should decide whether your strlen will return number of characters, or number of bytes occupied.

For the simplicity I will assume number of bytes (not chars) is enough for you, and if your input strings will consist only of basic 7b ASCII characters ([0-9A-Za-z !@#$%^&*,.;'\<>?:"|{}] etc... check any ASCII table ... no accent chars allowed (like á), that would produce multi-byte UTF8 code), then number of bytes will be also equal to number of characters (UTF-8 encoding is sort of compatible with 7b ASCII).

That means for example for input "Hell 1234" the memory at address d2 will contain these values (hexadecimal) 48 65 6C 6C 20 31 32 33 34 00. Once again, if you will check ASCII table, you will realize that for example byte 0x20 is the space character, etc... And the string is "nul terminated", the last value zero is part of the string, but it is not displayed, instead it is used by various C functions as "end of string marker".

So what you want to do in strlen is to load some register with d2 address, let's say rdi. And then scan byte by byte (byte, because ASCII encoding works in "1 char = 1 byte" way, and we will ignore UTF-8 variable-length codes), until you reach zero value in memory, and meanwhile count how many bytes it did take to reach it. If you would ponder on this idea a bit to make it "short" for CPU, and you will use the SCASB for scanning (you can also write it "manually" with ordinary mov/cmp/inc/jne/jnz if you wish), you may end with this:

rdi = d2 address
rdx = rdi  ; (copy of d2 address)
ecx = 255  ; maximum length of string
al  = 0    ; value to test against
repne scasb  ; repeat SCASB instruction until zero is found
; here rdi points at the zero byte
; (or it's d2+255 if the zero terminator is missing)
rdi -= rdx ; rdi = length of string
; return result as you wish

So you need first correct understand what values you are manipulating with, where they are, what is their bit/byte size, and what structure it has.

Then you can write instructions which produce any reasonable calculation based on those data.

In your case the calculation is "length_of_string = number of non-zero bytes in 7b ASCII encoded string stored in memory at address d2" (I mean after successful scanf part of code).

Considering how your source looks it looks to me like you don't understand what x86 CPU instruction do, and you just copy them from some examples. That will get you into trouble soon.

For example cmp 0, %rcx is checking if rcx (8 bytes "wide" value) is equal to zero. And you did load rcx with value from rdx, which was something from stack (maybe d2 address), so the rcx will be never zero.

And even if you would actually load the character values from memory into rcx, you would load 8 of them at the same time, so you would miss the 0 value as it would be only single byte inside some garbage, like 0xCCCCCCCC00343332 (I'm using 0xCC for the undefined memory after d2 buffer just for example, there may be any value).

So that code doesn't make any sense. If you at least partially understand what are CPU registers and what instructions like mov/inc/cmp/... do, then you have some chance to produce working code by simply using debugger a lot, to verify almost every 1-2 new instructions added to source, if it does manipulate the correct values, and fix them until you get it right.

Which requires you to have clear idea what is the "correct behaviour" first! (like in this case "fetching byte-by-byte values from d2 address, one after another, incrementing "length" counter, and looking for zero byte) So you can tell when the code does what you need, or not.


What I did want to point out with this answer is, that instructions themselves, while important, are less important than your vision of data/structures/algorithm used. Your question sounds like you have no idea what is "C string" in x86 assembly, or which algorithm to use. That makes it impossible for you to just "guess" some instructions into source and then verify if you guessed right or not. Because you can't tell what you want it to do. That's why I told you should check also non-gas x86 Assembly resources for the very basics, what is bit/byte/computer memory/etc... up until you somewhat understand what numeric values are manipulated for example to create "strings".

Once you will have good idea what it should do, it will be easy for you to catch in debugger things like swapped arguments (for example: movq %rcx, d2 - why do you put 8 bytes from rcx into memory at address d2? That will overwrite the input string), and similar, so you actually don't need to understand the instructions and gas syntax 100% well, just enough to produce something, and then over several iterations to "fix" it. Like checking the register+memory view, realizing the rcx didn't change, but instead the string data were damaged => try it other way...


Oh, and I completely forgot... you need to find documentation for your 64b platform ABI, so you know what is the correct way to pass arguments to C functions.

For example in linux these tutorials may help: http://cs.lmu.edu/~ray/notes/gasexamples/

And search here for word "ABI" for further resources: https://stackoverflow.com/tags/x86/info

Community
  • 1
  • 1
Ped7g
  • 16,236
  • 3
  • 26
  • 63
  • thank you very much for this answer! Just one small question, how I do scanf for the whole string? I figured out how do get single char or int, but still got problem with scanf string as more than one char. – Nana Mordov Drugobitski Jan 04 '17 at 19:25
  • as I understand, the most common way is to use .asciz "%c\n", but still it gets only the first character, and not the whole string. Is there any other option, because I need to allocate for the string 255 bytes without using dynamic allocation – Nana Mordov Drugobitski Jan 04 '17 at 19:36
  • That format string `"%254s"` says to [`scanf`](http://www.cplusplus.com/reference/cstdio/scanf/) to read whole single word (using 255 bytes of buffer at most (254 + nul terminator)). To get whole line you can use `gets`, or rather [`fgets`](http://www.cplusplus.com/reference/cstdio/fgets/), so you can specify maximum length of input (`fgets` can be exploited by malicious user). Please use ordinary C (C++) reference guide for any problems with C functions. `%c` is single char in the formatting string. – Ped7g Jan 04 '17 at 19:41
  • You can also call `scanf` with `"%c"` multiple times providing it with incremented buffer pointer to store value (starting at `d2`, then `d2+1`, ...). This is not suggestion (`fgets` is), I'm just trying to illustrate how there always several ways how to calculate the same result of particular calculation. Any medium sized calculation task you can write with thousands, millions, or lot more of possible code variants, which will produce the identical result. So again understanding your data and what result you are calculating is more important than particular instructions sequence. – Ped7g Jan 04 '17 at 19:46
  • And one more note ... :) *"the most common way"* (judging by Stack Overflow questions, and quality of some tutorials) is usually the wrong/inefficient one. If you think the common solution is weird, and you can do it better, try it! You may be actually right. If you are capable to reason about your task on "high" level first, simplifying all the formulas and dependencies, reusing intermediate results, buffers, values in registers and well designed subroutines -> you may often produce shorter, simpler and easier to understand instruction code, than "common way" example at Internet will show. – Ped7g Jan 04 '17 at 19:52
  • Thank you! The point is it is an assignment for c.s course, so there are some instructions. I need to do it using only scanf. So what I thought is to call scanf and put each time a single char into buffer multiple times in loop until it gets to \0. Is this a right way to approach this problem? – Nana Mordov Drugobitski Jan 04 '17 at 19:54
  • what I mean is refer to this string as a array of chars like in C – Nana Mordov Drugobitski Jan 04 '17 at 19:56
  • `char[]` in C is formed by consecutive bytes in memory, your `d2` can work like that (my whole answer is expecting to work with it in this way, showing also example of how such string looks as sequence of byte values). Reading single char at "moving" pointer (d2, d2+1, d2+2, d2+3, ...) will thus build the string up from particular bytes. Just don't expect to receive `0`, there's no such char on keyboard, such inputs are usually end by `enter`, so test for value `10` or `13` instead, and write `0` into buffer to terminate the string. Then buffer at address `d2` will contain the whole "line". – Ped7g Jan 04 '17 at 20:00
  • Thank you so much! I have learned a lot from your answers, and really appreciate the effort. Hope one day my programming knowledge will reach yours :) as for what you have said, the loop will break at \n and then I just add a \o, all in ascii of course. Thanks again. I am going to try it – Nana Mordov Drugobitski Jan 04 '17 at 20:27
  • Yes... while the `"\n"` vs *"in ascii"* makes some sense (defined as value `10` = new line), the `"\0"` feels a bit weird to my ears. That escape with number means, that it is compiled as that number, so in C `"\0"` will compile as byte of value `0`. In `nasm` you have little reason to use escaped codes (especially as they don't work with `'` or `"`, only inside "backticks"), when you can write `0` as `0` directly. It's actually not that hard to reach my *knowledge* of programming (already shrinking by forgetting), but my *experience/practice* will be harder to beat, I'm still progressing ;) – Ped7g Jan 04 '17 at 20:44
  • Wait, you use `gas`, not `nasm`.. Then escaped chars probably work I think. – Ped7g Jan 04 '17 at 20:46