Mysterious compiled (BASIC?) legacy program

Question

I have a legacy program, i assume was written in BASIC (a file with ending .bas). It seems that it had been compiled though! When opened in a hex editor, Strings as well as Comments are readable, a part, where calculations are done, is not. AFAIK, BASIC was/is an interpreted language.

Question:

have there been, in the past, compilers or runtime environments for BASIC where the compiled output is stored as a .bas file?
is there a decompiler?

I believe some BASIC compilers did exist, but you're right that the output from such a compilation probably wouldn't use the .BAS extension. I think it's more likely that what you have there is some binary format of the source code where parts are stored in binary form, some not, but I'm obviously just guessing here. — 500 - Internal Server Error, Jan 07 '21 at 10:38
I know for a fact that GW BASIC (the ancient text-and-simple-graphics basic that came with some MS-DOS versions) store their source code in some internal encoding but still used `.BAS` extensions, so maybe that's what it is. I don't know if QBasic (its successor) or QuickBasic (the "real developers" version of QBasic) continued with this tradition, but it's possible. — Joachim Sauer, Jan 07 '21 at 11:59

Jerry Stratton · Answer 1 · 2021-01-10T20:37:14.800

For old-school BASIC programs, there was a difference between compilation and tokenization. Compilation converted the BASIC code to machine code, and would usually be stored with a file extension indicating that the code should be run directly rather than interpreted; often, this extension was some variation of “.BIN”. On personal computers at least, compilation usually required third-party software to convert the BASIC statements to machine code.

While BASIC programs could usually be saved as straight ASCII, with BASIC statements fully represented by their text, most BASICs tokenized saved programs by default. Tokenized files were usually saved with some variation of a .BAS file extension.

Tokenization was generally a one-to-one translation of BASIC statement/function to the one- or two-byte code for that statement or function. This saved space on the system; both disk space and RAM were limited on older personal computers. But it also made it much easier for the system to run the code on the fly—interpret it—and made the interpretation much faster.

Without tokenization, the difference between RESET and RESTORE in the Radio Shack Color Computer’s Extended Color BASIC, for example, won’t show up until comparing the fourth character. With tokenization, the difference shows up on comparing the first character—9D vs. 8F.

For example, this archiveteam.org page lists the tokenization numbers for GW-BASIC.

Detokenization, or conversion from the tokens to the textual representation of the statement, simply reverses the process. This reversal would have been performed every time the user listed the program. On a modern computer, a detokenization program should be able to be easily written in just about any modern scripting language. As long as you know the format, detokenization is just a matter of going through the tokenized file byte-by-byte and converting tokens back to their equivalent BASIC statement or function.

For example, bascat claims to detokenize GW-BASIC.

Here’s an example of tokenization; I’m using the TRS-80 Color Computer’s Extended Color BASIC because I have the tools easily available to tokenize it, but the basic idea will be the same for most old-school BASICs.

The (somewhat nonsensical) BASIC program:

10 RESET(14,15)
20 RESTORE

A hex dump of the tokenized file:

00000000  26 0b 00 0a 9d 28 31 34  2c 31 35 29 00 26 11 00
00000010  14 8f 00 00 00                                  
00000015

The first two characters are the address of the next line; when detokenizing from a file you would probably ignore these addresses if your particular language uses them. (They’re mainly for running the code: as an example, if you have a GOTO 60 in a line, the interpreter can find line 60 without having to interpret tokens to get there.)

The second two characters are the line number: 000A is 10. The next character, 9D, is the tokenization of RESET. Then, 28 is the ASCII value of an open parenthesis, 31 is “1” and 34 is “4” (i.e., the 14 as the first parameter to RESET; 2C is a comma, 31 and 35 are the 1 and 5 of the second parameter to RESET, 29 is the close parenthesis, and 00 is the end of the line.

The next two characters are the address of the next line, and the 0014 are the second line’s line number: 14 is hex for 20. Finally, 8F is the tokenization of RESTORE, a 00 ends the line, and the final two zeroes end the program.

I don't care about basic, but this was a fascinating read! – Maurice Jan 10 '21 at 19:02 — Maurice, Jan 10 '21 at 19:02

Mysterious compiled (BASIC?) legacy program

1 Answers1

Linked