-5

I compiled the following C code with GCC for windows 10 (mingw-64) :

#include <stdio.h>
int main(){
    printf("Hello World!");
    return 0;
}

with the command

gcc.exe -o test test.c

It works because when I execute the resulting file I do get a Hello World! in the console, however I am surprised because when I open test.exe in notepad++ it is 220 lines long with some readable text in it such as

Address %p has no image-section VirtualQuery failed for %d bytes at address %p

and also

Unknown pseudo relocation protocol version %d. Unknown pseudo relocation bit size %d.

However when I open the same file in Sublime Text 3, I get over 3300 lines of just some seemingly random numbers and letters such as :

4d5a 9000 0300 0000 0400 0000 ffff 0000
b800 0000 0000 0000 4000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 8000 0000
0e1f ba0e 00b4 09cd 21b8 014c cd21 5468
6973 2070 726f 6772 616d 2063 616e 6e6f
7420 6265 2072 756e 2069 6e20 444f 5320
6d6f 6465 2e0d 0d0a 2400 0000 0000 0000
5045 0000 6486 0f00 5aca 455d 0068 0000
9304 0000 f000 2700 0b02 021e 001e 0000
0038 0000 000a 0000 e014 0000 0010 0000
0000 4000 0000 0000 0010 0000 0002 0000
0400 0000 0000 0000 0500 0200 0000 0000
0020 0100 0004 0000 0e3e 0100 0300 0000
0000 2000 0000 0000 0010 0000 0000 0000
0000 1000 0000 0000 0010 0000 0000 0000
0000 0000 1000 0000 0000 0000 0000 0000
0080 0000 6c07 0000 0000 0000 0000 0000
0050 0000 7002 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000

I also tried to get the assembly version and this one is the same in notepad and sublime :

    .file   "test.c"
    .text
    .def    __main; .scl    2;  .type   32; .endef
    .section .rdata,"dr"
.LC0:
    .ascii "Hello World!\0"
    .section    .text.startup,"x"
    .p2align 4,,15
    .globl  main
    .def    main;   .scl    2;  .type   32; .endef
    .seh_proc   main
main:
    subq    $40, %rsp    #,
    .seh_stackalloc 40
    .seh_endprologue
 # test.c:2: int main(){
    call    __main   #
 # test.c:3:    printf("Hello World!");
    leaq    .LC0(%rip), %rcx     #,
    call    printf   #
 # test.c:5: }
    xorl    %eax, %eax   #
    addq    $40, %rsp    #,
    ret 
    .seh_endproc
    .ident  "GCC: (x86_64-posix-seh-rev0, Built by MinGW-W64 project) 8.1.0"
    .def    printf; .scl    2;  .type   32; .endef

First question :

why is the output different in sublime text and notepad ?

Second question :

where are the 0s and 1s , I thought machine code was only 0s and 1s ?

Third question :

how come it's 3300 lines for just a simple hello world, doesnt that sound grossly inefficient?

Thanks for any insight!

Mister Fresh
  • 670
  • 1
  • 10
  • 22
  • 1
    You should use xxd to understand, 0 and 1s are binary, 0 to f is hex, for binary you can do `xxd -b test.exe`, and for hex you should do `xxd test.hext`, and then you will see the what it's inside the file in binary or hex and also the ascii output (what you see in notepad++). – cpatricio Aug 04 '19 at 16:38

3 Answers3

5

An .exe file is a binary file. Most of it is non-printable, non-human readable bytes. So your question actually boils down to, why are these two text editors doing two different things with a non-text file which they're not even designed to manipulate in the first place?

Buried within a binary file may be some human-readable strings. First of all, some fraction of the bytes in a binary file will be, by chance, in the printable set. Also, computer programs that contain text strings like "Can't open file" will typically end up containing those strings embedded, literally, in their binaries.

Typically, a text editor displays a binary file as garbage. Typically, it displays those printable characters it knows about, indiscriminately intermixed with "funny" representations of the nonprintable characters. (On Windows platforms, at least, it's not unusual for the nonprinting characters to be displayed using a mapping to the old MS-DOS character set, which did have special graphics characters in many of the nonprintable positions.) It looks like that's what Notepad is doing.

It looks like Sublime is noticing that the file is binary, and converting every byte in it to hexadecimal. That means you can't immediately see the printing characters, but you can uniformly see (as hexadecimal) all the characters, the printable and the nonprintable, side by side.

To make this more clear, let's look at a slightly different case. Consider this program:

#include <stdio.h>

int main()
{
    char binary[] = "\1\2\3Hello\4\5\6World\x1E\x1F\x20\x21";
    fwrite(binary, 1, sizeof(binary), stdout);
}

This program prints a mixture of text and binary characters to its standard output. If you compile and run this program and redirect its output to a file, you'll end up with a file with a mixture of text and binary characters in it, just like (in this respect) your .exe file.

If I print the output of this program in my normal environment, I get:

HelloWorld !

We can see the printable strings Hello and World as we might have expected, and a ! character as we might not have expected. In my normal environment, the unprintable characters print as nothing at all.

If I printed the output of this program in an MS-DOS environment (where, as I mentioned, a lot of those theoretically "unprintable" characters did have graphic representations), we might see something like

☺☻♥Hello♦♣♠World▲▼ !

If I run this program through a program that converts every byte to its hexadecimal representation, I get

01020348656C6C6F040506576f726C641E1F202100

Let's look at this carefully. It starts with hex 010203, which clearly corresponds to the leading "\1\2\3" of the string. Next comes 48656C6C6F, which if you look them up are the hexadecimal ASCII codes for the string "Hello". Next comes 040506, which corresponds to the "\4\5\6" part. Next comes 576F726C64, which is, you guessed it, "World". Next comes 1E1F2021, which is of course the final "\x1E\x1F\x20\x21". Finally, at the very end, there's 00, which is the '\0' character which the compiler automatically appended to the end of the string in the binary array.

You've probably figured this out, but hex 20 and 21 are the ASCII codes (hexadecimal) for the space and ! characters, so that's what those were doing in the output.

If I run the output through the Unix/Linux command cat -v, which makes the nonprintable characters visible using a "control character" representation ^X, I get:

^A^B^CHello^D^E^FWorld^^^_ !^@

Finally, here's one more representation of the output, run through a "hex dump" program which shows both the hexadecimal and text representations, side by side, but with nonprintable characters replaced by dots:

01 02 03 48 65 6c 6c 6f  04 05 06 57 6f 72 6c 64   ...Hello...World
1e 1f 20 21 00                                     .. !.           
Steve Summit
  • 45,437
  • 7
  • 70
  • 103
2

Why does the Output differ?

Edit: Read that wong... The first output is the raw bytecode, the second the actual human-readable assembler-version - they both mean the same thing.

Where are the 0s and 1s?

They are right there - you are just not seeing them. To your computer, everything is 0s and 1s already. For a human being, this is just unreadable. The bytecode is showing you the 0s and 1s in Hexadecimal chunks (https://en.wikipedia.org/wiki/Hexadecimal). This is just another numerical representation, ffff for example would translate to 1111111111111111 in binary. The previously mentioned assembler-file also (for the purpose of this short explanation) directly translate into 0s and 1s. Assembler is used by assembly programmers to reverse-engineer and write actual machine-code.

Why is my program so long?

It's not. Your actual program is this:

main:
subq    $40, %rsp    
call    __main  
leaq    .LC0(%rip), %rcx  
call    printf 
xorl    %eax, %eax 
addq    $40, %rsp   
ret 

I suspect this question was asked out of curiosity (nothing wrong with that!), but you need to catch up on a lot of things before diving into disassembly and writing your own assembly-code. Try researching this for a start:

  • How computers represent data (ints, floats, chars, pointers) and why hexadecimal notation is helpful
  • The basics of computer-architecture
  • How data is stored (long-term, RAM, registers)
  • The purpose and function of the ISA (InstructionSetArchitecture)
  • flag-registers, jumps and conditional jumps
  • function calls
  • the instruction-, stack- and base-pointer
  • calling-conventions
  • arithmetic and logical instructions
  • ...

It's a huge field and there is a lot to learn. This is not a complete study-guide, but I hope these pointers help you to start piecing the puzzle together - it's lots of fun :-)

Community
  • 1
  • 1
3ch0
  • 173
  • 1
  • 7
  • Thanks for the answer. Indeed most of this stuff I will probably not make much use of in my current field (web development) but I'm interested in having a better understanding of what happens at a lower level. I want to do a simple program in C using pointers and memory management and learn this way, but if you want to add some links that you find helpful I'll take a look. – Mister Fresh Aug 04 '19 at 17:27
1

The random numbers shown in the Sublime are your program. Every four digits are 16 bits of your code written in hexadecimal. That's how your computer sees the program. Sublime makes it readable for you because the .exe file opened as plain text would be unreadable at all. Unfortunately, I don't know what notepad++ shows you.

When you disassemble the code, the output is in plain text, so it's shown in the same way in both Sublime and Notepad++.

Regarding the size of the file, your program has to have stdlib.h included to it. Try compiling something simpler, which doesn't use any libs.

And the size isn't that big. It's 3300 line, 8 numbers in each line, each has 16 bits. 3300 * 16 * 8 = 422 400 bits = 52 800 B ~ 51.5 KiB. The file weights about that, doesn't it?

Hoxmot
  • 101
  • 7
  • 2
    This is wrong, each hex digit is 4 bits, 4 hex numbers are 2 bytes (16 bits), 3300*8*2 = 52800 bytes (422400 bits) or 51.56 KiB. – cpatricio Aug 04 '19 at 16:50
  • @cpatricio in windows explorer the exe file is shown to be 54024 bytes (57344 on disk) with the exact lines number being 3377 and the last line only half full. – Mister Fresh Aug 04 '19 at 16:58
  • Thanks for the answer. How would you output something to the console without including the standard library though ? – Mister Fresh Aug 04 '19 at 17:20
  • @MisterFresh I think it would require including some assembly into your code. However, I've meant something which doesn't include any output. Like simple `int a = 2 + 2;`. I know it doesn't make much sense and may be optimised by the compiler but just so you see the difference in the file size. – Hoxmot Aug 04 '19 at 17:31