For old-school BASIC programs, there was a difference between compilation and tokenization. Compilation converted the BASIC code to machine code, and would usually be stored with a file extension indicating that the code should be run directly rather than interpreted; often, this extension was some variation of “.BIN”. On personal computers at least, compilation usually required third-party software to convert the BASIC statements to machine code.
While BASIC programs could usually be saved as straight ASCII, with BASIC statements fully represented by their text, most BASICs tokenized saved programs by default. Tokenized files were usually saved with some variation of a .BAS file extension.
Tokenization was generally a one-to-one translation of BASIC statement/function to the one- or two-byte code for that statement or function. This saved space on the system; both disk space and RAM were limited on older personal computers. But it also made it much easier for the system to run the code on the fly—interpret it—and made the interpretation much faster.
Without tokenization, the difference between RESET
and RESTORE
in the Radio Shack Color Computer’s Extended Color BASIC, for example, won’t show up until comparing the fourth character. With tokenization, the difference shows up on comparing the first character—9D vs. 8F.
For example, this archiveteam.org page lists the tokenization numbers for GW-BASIC.
Detokenization, or conversion from the tokens to the textual representation of the statement, simply reverses the process. This reversal would have been performed every time the user listed the program. On a modern computer, a detokenization program should be able to be easily written in just about any modern scripting language. As long as you know the format, detokenization is just a matter of going through the tokenized file byte-by-byte and converting tokens back to their equivalent BASIC statement or function.
For example, bascat claims to detokenize GW-BASIC.
Here’s an example of tokenization; I’m using the TRS-80 Color Computer’s Extended Color BASIC because I have the tools easily available to tokenize it, but the basic idea will be the same for most old-school BASICs.
The (somewhat nonsensical) BASIC program:
10 RESET(14,15)
20 RESTORE
A hex dump of the tokenized file:
00000000 26 0b 00 0a 9d 28 31 34 2c 31 35 29 00 26 11 00
00000010 14 8f 00 00 00
00000015
The first two characters are the address of the next line; when detokenizing from a file you would probably ignore these addresses if your particular language uses them. (They’re mainly for running the code: as an example, if you have a GOTO 60 in a line, the interpreter can find line 60 without having to interpret tokens to get there.)
The second two characters are the line number: 000A is 10. The next character, 9D, is the tokenization of RESET
. Then, 28 is the ASCII value of an open parenthesis, 31 is “1” and 34 is “4” (i.e., the 14 as the first parameter to RESET
; 2C is a comma, 31 and 35 are the 1 and 5 of the second parameter to RESET
, 29 is the close parenthesis, and 00 is the end of the line.
The next two characters are the address of the next line, and the 0014 are the second line’s line number: 14 is hex for 20. Finally, 8F is the tokenization of RESTORE
, a 00 ends the line, and the final two zeroes end the program.