Regular expression of a string that starts with a letter followed by only letters and digits except two specific strings in lex

Question

I am trying to write the regular expression of a string that has the characteristics mentioned above. Here, the specific strings are "true" and "false". For example:"d6" will be accepted but "6d" won't. Also "true" and "false" won't be accepted. I have googled a lot, got various examples but still can't make it. Please help.

I'm a little confused what you're trying to match but are you familiar with character classes? For example, if you wanted to match a lowercase character you could do `[a-z]` or if you wanted both lower and upper case, `[a-zA-Z]`. Using that notation, it seems something like `[a-zA-Z][a-zA-Z0-9]*` may be what your looking for. Am I misunderstanding what you're after? — user2027202827, Feb 22 '17 at 05:24
I just want to match any string starts with a letter followed by letters and digits. But if the string is "true" or "false", that won't be accepted. — Farhan Kanak, Feb 22 '17 at 09:11

score 1 · Answer 1 · edited May 23 '17 at 12:00

If you are really planning on using lex (or flex), then you need to match the keywords with separate rules.

Remember that (f)lex applies the maximal munch rule: at each point in the input, the token matched is the longest possible match of all patterns, so the rule attached to the pattern with the longest match is performed. Furthermore, if two or more different patterns both produce the same longest match, the first one in the file wins.

So this is how you match identifiers other than true or false:

false   { /* Action for the keyword false */ }
true    { /* Action for the keyword true */ }
[[:alpha:]][[:alnum:]]* {
          /* Action for all other identifiers */
        }

If you also wanted to match integers, you could add:

[[:digit:]]+ {
          /* Action for integers */
        }

That will not produce an error for a token like 6d. There are good reasons to treat that as two tokens rather than produce an error (see, for example, the discussion in this answer), but if you really want to treat it as an error, you can add an error pattern:

[[:digit:]]+[[:alpha:]][[:alnum:]]* {
          /* Action for tokens which start with a digit and contain a letter */
        }

The Posix character classes used above ([[:digit:]] and so on) are documented in the flex manual chapter on patterns.

Got it worked now! Thanks for your suggestion. The regexes that work are: true {printf("true\n");} false {printf("false\n");} ^[a-z|A-Z]+[0-9]* {printf("id\n");} . ; — Farhan Kanak, Feb 23 '17 at 02:01
@Farhan: I don't know your precise use case but I doubt whether `^[a-z|A-Z]+[0-9]*` does what you want. For example, the `|` is not special in a character class, so that pattern will match `|93`. On the other hand, it won't match `ident4c`, which looks like an identifier to me. And finally the `^` restricts the match to the very beginning of a line. I strongly recommend that you read the documentation links in my answer. — rici, Feb 23 '17 at 04:03

Mohit · Answer 2 · 2017-02-22T06:51:57.750

0

I think the below regex would meet your requirements:

/(?!.*(true|false))^[a-zA-Z]+[a-zA-Z0-9]*$/

The details on the above Regex are:

Firstly, there should not be string 'true' or 'false' in the whole string which is (?!.*(true|false)) this is a negative lookahead which means nothing followed by values true or false is allowed.You can read more about the lookahead from here.
Secondly, it should start with a character so we have ^[a-zA-Z]+
Lastly, there can be any number of charcters and numbers in the rest string and then end is matched by [a-zA-Z0-9]*$.

Hope this solves your problem.

edited Feb 22 '17 at 06:51

answered Feb 22 '17 at 06:09

Mohit

608
4
19

This regex matches `true` and `false` because the first list match `[a-zA-Z]` consumes their first letter and the negative lookahead won't be able to check that. – Phu Ngo Feb 22 '17 at 06:18
@PhuNgo I have edited the Regex can u please see if now anything unwanted passes by. Thanks for your error pointing. – Mohit Feb 22 '17 at 06:52
That regex will reject "falsetto" and other words starting with `true` or `false`. Also, negative lookahead assertions are not available in many regex implementations, including the (f)lex compiler-constructor, which the [tag:lex] tag suggests is relevant. – rici Feb 22 '17 at 07:05
Yeah I tried this out in Perl and that's working fine not sure about Lex as I hv never used that. Also the Regex I have code is intentional of removing true or false present anywhere in string not only separately – Mohit Feb 22 '17 at 07:26
@Mohit since it is evident from context that OP is trying to build a lexer, it is extremely unlikely that they wish to reject all identifiers starting with the letters "false". Would you expect that from a programming language you were using? – rici Feb 22 '17 at 07:54

score 0 · Answer 3 · answered Feb 22 '17 at 10:56

Your conditions are

I just want to match any string starts with a letter followed by letters and digits.

This can be achieved with ^[a-zA-Z][a-zA-Z0-9]+$

But if the string is "true" or "false", that won't be accepted.

This is the trickier part, it seems lex does not support negative lookahead or positive/negative lookbehind. so we're stuck only with positive lookahead.

This regex solves the problem with positive lookahead

^(?=[^tf]|t[^r]|tr[^u]|true[a-zA-Z0-9]|f[^a]|fa[^l]|fal[^s]|false[a-zA-Z0-9])[a-zA-Z][a-zA-Z0-9]+$

the lookahead part is ^(?=[^tf]|t[^r]|tr[^u]|true[a-zA-Z0-9]|f[^a]|fa[^l]|fal[^s]|false[a-zA-Z0-9])

it expects the next string is not true or false. it allows truer, falsetto etc preventing only true and false.

The problem is I can't find a way to right this in lex format. as lex doesn't use the (?=) syntax. It uses r1/r2 syntax where r1 is matched if it is followed by r2. In our case we want to capture r2 if r1 is true.

I hope someone else continue converting this to proper lex format

the demo for the given regex can be found here

(f)lex doesn't implement *any* form of lookaround, except for the trailing context operator. Negating regexes is possible but not very readable. Fortunately, you can use the (f)lex matching rules to easily solve token recognition problems like this; see [my answer](http://stackoverflow.com/a/42396701/1566221). — rici, Feb 22 '17 at 16:14
Thank you for the explanation. Learned something about lex ☺ — Abdul Hameed, Feb 22 '17 at 20:27

Regular expression of a string that starts with a letter followed by only letters and digits except two specific strings in lex

3 Answers3