So given a token, you need to figure out if it's an instruction mnemonic. (If not, it could be a symbol declaration, or part of a macro).
Note that each mnemonic has multiple opcodes, and you need to choose based on the operands. (e.g. mov r32, imm32
vs. mov r32, r/m32
vs. mov r/m32, imm32
). Sometimes there's a choice, and one encoding is shorter than another. (e.g. a special opcode for shift/rotate with an immediate count of one, or when you can choose between add r32, imm8
(sign-extended immediate) vs. add r32, imm32
.) Or since this is just a toy assembler, keep the code simple and YASM to generate more optimal code for actual use.
The standard choice for looking something up with a string as a key is a Hash Table. C++ has std::unordered_map. You're right that a linear search of a table of strings is a bad idea. Your idea of doing a switch
on the first 4 chars is not bad, but it won't work well in practice because the set of sequences you want to recognize is very sparse. (Only a couple hundred insn mnemonics in 2^32 possibilities, so a lookup table isn't viable). This is why hashes exist.
One trick I've heard of is to keep keywords in the symbol table, with a flag that says they're a keyword. So you only have one hash-table lookup for a token, rather than looking for it as a mnemonic, then as a directive, then as a symbol.
There are many data structures for storing a dictionary that you can match strings against. A Trie or Radix Trie could be a good choice. Since you need to fetch associated data, a DAWG is probably not a good choice.
There are data structures and algorithms for so many different things that you can usually expect to find something with the right search terms. "match string against a set of strings" doesn't actually come up with any obvious google hits about hash tables on the first page, though. I'm not sure what search terms would find hash tables if you didn't already know about their existence.