How are strings embedded in binary files?

Question

I'm writing my own bytecode and virtual machine (on .NET) and one thing i can't figure out is how to embed strings into my bytecode. Any ideas now how i should do it?

It's called bytecode in Java. In .NET, it's CIL, so I've updated your tags. Have you consulted a CIL reference? — Steven Sudit, Sep 19 '09 at 05:11
no, i'm writing on the .net platform a program that interprets an array of bytes as my own custom flavor of bytecode. — RCIX, Sep 19 '09 at 05:12
What i can't figure out is how to embed anything other than numbers into that array. — RCIX, Sep 19 '09 at 05:14

score 1 · Accepted Answer · answered Sep 19 '09 at 05:31

1

Apparently you're defining your very own byte code. this has nothing to do with the syntax/grammar of .NET CIL, right ?

If so, and if you concern is how to encode strings (as opposed to other instructions such as jumps, loops, etc.), you can just invent your own "instruction" for it.

For example, hex code "01xx" could be for a string containing xx bytes (0 -255). Your language interpreter would then be taught to store this string on the stack (or whereever) and move to decode the following byte code located xx bytes further down the bytecode stream.

If you concern is how to mix character data and numeric data in whatever storage you have for the bytecode, please provide specifics and maybe someone can help...

answered Sep 19 '09 at 05:31

mjv

73,152
14
113
156

Correct, i'm making my own. I kind of get what you're saying but ieach instruction in my bytecode consists of 4 separate bytes (1 for the opcode and 3 other ones, who's purpose varies with the instruction), and i'd like to avoid having variable length instructions. It could be safely achieved with encoding the length of the data in the instruction itself but it would make it much more complex... – RCIX Sep 19 '09 at 05:39
1

I see the advantages of having the bytecode with a fixed length and format. In that case the strings may just be implemented as a instruction for variable declaration (which you may readily have designed) whereby the index (be it address, offset, subscript...) where the actual string is stored. The difference with a regular variable is that the storage where the string resides is initialized with the string value. Indeed with 3 byte instructions you may find yourself limited for other types than just the strings (say how do you encode a numeric value bigger than 8 millions? – mjv Sep 19 '09 at 05:51
That's another thing i'm a bit puzzled about as well... But i may just go ahead and do that. Thanks! – RCIX Sep 21 '09 at 05:47

score 0 · Answer 2 · answered Sep 19 '09 at 05:37

If you can store numbers in an array, then you can store ASCII data in the same array. Ignoring the idea of a string as a class, a simple string is just a character array anyway -- and in C, a byte with a value of 0 indicates the end of the string.

As a simple proof-of-concept in C:

int main()
{
    putchar(104); // h
    putchar(101); // e
    putchar(108); // l
    putchar(108); // l
    putchar(111); // o
    putchar(10);  // \n
    return 0;
}

Output:

$ ./a.out
hello

Maybe a reference on character arrays as strings would help?

It's not quite that simple. I'm trying to embed strings with other bytes (which happen to be instructions in my own custom format) and i'm not sure how to do that. — RCIX, Sep 19 '09 at 05:52

How are strings embedded in binary files?

2 Answers2