Efficiently comparing a relatively large number of strings with varying lengths

Question

For a school project I wrote an x86 disassembler and just so I have something a bit more usable I'd like to make a complementary assembler. The problem is that I'm not really sure how I can efficiently compare the opcode to a list of char*s.

Strcmp used in excess would surely cause lag. For those with experience, what's the best thing to do? Should I switch a dword of the first 4 characters and go on from there? Get a checksum of each? I suppose this could be seen as opinionated and controversial but there's surely an accepted and efficient way to do something like this. I'm just not really sure how. I'm mainly concerned with efficiency because I want to be able to have a file that you can send it and it writes it.

You mean insn mnemonic (like `add`), not opcode (one or more bytes), right? — Peter Cordes, Nov 28 '15 at 03:23
If I were you, I'd consider looking at the source-code for either NASM or FASM. Each of them have covered this ground already. You can find them here: http://www.nasm.us/pub/nasm/snapshots/latest/ and here: http://flatassembler.net/download.php — enhzflep, Nov 28 '15 at 03:28
I don't know what assembly language instruction set you are targeting, but you could always consider seeing if [`gperf`](https://www.gnu.org/software/gperf/manual/gperf.html) might produce a reasonable hash function — Michael Petch, Nov 28 '15 at 03:38
Thanks for good answers, I'll use unordered_table. Perfect and surprised before now I've yet to run across it. — LUPE, Nov 28 '15 at 03:42

Peter Cordes · Accepted Answer · 2015-11-28T10:22:07.470

So given a token, you need to figure out if it's an instruction mnemonic. (If not, it could be a symbol declaration, or part of a macro).

Note that each mnemonic has multiple opcodes, and you need to choose based on the operands. (e.g. mov r32, imm32 vs. mov r32, r/m32 vs. mov r/m32, imm32). Sometimes there's a choice, and one encoding is shorter than another. (e.g. a special opcode for shift/rotate with an immediate count of one, or when you can choose between add r32, imm8 (sign-extended immediate) vs. add r32, imm32.) Or since this is just a toy assembler, keep the code simple and YASM to generate more optimal code for actual use.

The standard choice for looking something up with a string as a key is a Hash Table. C++ has std::unordered_map. You're right that a linear search of a table of strings is a bad idea. Your idea of doing a switch on the first 4 chars is not bad, but it won't work well in practice because the set of sequences you want to recognize is very sparse. (Only a couple hundred insn mnemonics in 2^32 possibilities, so a lookup table isn't viable). This is why hashes exist.

One trick I've heard of is to keep keywords in the symbol table, with a flag that says they're a keyword. So you only have one hash-table lookup for a token, rather than looking for it as a mnemonic, then as a directive, then as a symbol.

There are many data structures for storing a dictionary that you can match strings against. A Trie or Radix Trie could be a good choice. Since you need to fetch associated data, a DAWG is probably not a good choice.

There are data structures and algorithms for so many different things that you can usually expect to find something with the right search terms. "match string against a set of strings" doesn't actually come up with any obvious google hits about hash tables on the first page, though. I'm not sure what search terms would find hash tables if you didn't already know about their existence.

Efficiently comparing a relatively large number of strings with varying lengths

1 Answers1