Is it bad idea using regex to tokenize string for lexer?

Question

I'm not sure how am I gonna tokenize source for lexer. For now, I only can think of using regex to parse string into array with given rule (identifier, symbols such as +,-, etc).

For instance,

begin x:=1;y:=2;

then I want to tokenize word, variable (x, y in this case) and each symbol (:,=,;).

score 11 · Accepted Answer · answered Feb 07 '13 at 22:02

11

Using regexes is a common way of implementing a lexer. If you don't want to use them then you'll sort of end up implementing some regex parts yourself anyway.

Although performance-wise it can be more efficient if you do it yourself, it isn't a must.

answered Feb 07 '13 at 22:02

Oak

26,231
8
93
152

score 9 · Answer 2 · edited Nov 17 '14 at 19:44

Using regular expressions is THE traditional way to generate your tokens. lex and yacc (or flex and bison) are a traditional compiler creation pair, where lex does nothing except tokenize symbols and pass them to YACC

http://en.wikipedia.org/wiki/Lex_%28software%29

YACC is a stack based state machine (pushdown automaton) that processes the symbols.

I think regex processing is the way to go for parsing symbols of any level of complexity. As Oak mentions, you'll end up writing your own (probably inferior) regex parser. The only exception would be if it is dead simple, and even your posted example starts to exceed "dead simple".

in lex syntax:

:=                   return ASSIGN_TOKEN_OR_WHATEVER;
begin                return BEGIN_TOKEN;
[0-9]+               return NUMBER;
[a-zA-Z][a-zA-Z0-9]* return WORD;

Character sequences are optionally passed along with the token.

Individual characters that are tokens in their own right [e.g. ";" )get passed along unmodified. Its not the only way, but I have found it to work very well.

Have a look: http://www.faqs.org/docs/Linux-HOWTO/Lex-YACC-HOWTO.html

Is it bad idea using regex to tokenize string for lexer?

2 Answers2