How to retrieve tokens from a string that are keywords?

Question

For example, if the input is x+=5, the program should return an array of x, +=, 5. Notice that there is no space between x and +=, so splitting by spaces only probably won't work, because then you would have to iterate through it all over again to find the keywords.

How would I do something like this? Is there an efficient way to do this in C?

The most generic advice is to use a lexer. You might want to google that. If you want something lighter weight, you should probably just code it by hand. — Chris Beck, Sep 08 '15 at 16:49
Please post your attempt so far, or provide a minimal working example (MWE) so we can see what you've tried so far, and steer you in the right direction. — Cloud, Sep 08 '15 at 16:49
a parser would input the expression, character by character, and apply the rules for identifiers operators, etc to extract the desired array. Also note, a good parser would take the 'longest' token, so it would not separate '+=' into two tokens. In general you want to perform some lexical analysis on the expression to extract the tokens. This kind of activity will quickly escalate into a lot of code, especially if you properly handle all the edge cases. — user3629249, Sep 08 '15 at 19:38

score 5 · Accepted Answer · edited Nov 17 '17 at 22:05

Lexing is not specific to C (in the sense that you'll use similar techniques in other programming languages). You could do that with hand-written code (using finite automaton coding techniques). You could use a lexer generator like flex. You might even use regexprs, e.g. regex.h functions on POSIX systems.

Parsing is also a well known domain with standard techniques (at least for context free languages, if you want some efficiency). You could use recursive descent parsing, you could generate a parser using bison (which has examples very close to your homework) or ANTLR. Read more about LL parsing & LR parsing. BTW, parsing techniques can be used for lexing.

BTW, there are tons of free software (e.g. interpreters of scripting languages like Guile, Lua, Python, etc....), JSON, YAML, XML... parsers, several compilers (e.g. tinycc) etc... illustrating these techniques. You'll learn a lot by studying their source code.

It could be easier for your to sometimes have a lookahead of one or two characters, e.g. by first reading the entire line (with getline(3) or else fgets(3), and perhaps even readline, which gives you a line editor). If you cannot read a whole line consider using fgetc(3) and ungetc when needed. The classifying utilities from <ctype.h> like isalpha might be helpful.

If you care about UTF-8 (and in principle you should) things become slightly more complex since some Unicode characters (like €, é, , ...) are represented in UTF-8 by several bytes. A library like libunistring should be very helpful.

to code by hand, you would need to code a 'state machine' that has states for every possible case (actually this is not that many states) the hardest part would be coding the state transitions. — user3629249, Sep 08 '15 at 19:42

How to retrieve tokens from a string that are keywords?

1 Answers1

Linked