1

I am working on a lexer. I need to identify patterns in BNF and specify them using Regular expressions. I know there are TOKENS e.g. keywords, identifiers, operators etc. So far I have defined Regular expressions:

 digit=[0-9]
    integer={digit}+
    letter=[a-zA-Z]

But the given BNF rule for Identifier is:

   < id > ::= < letter >
                | "_"
                | < id > < digit >
                | < id > < letter >

Since the <id> is defined by Recursive pattern, how can I express this pattern using Regular expression. I tried this Regular expression: id={letter}+|"_"|{id}{digit}+ for the above BNF rule but it is giving me error that Regular expression contains cycle.

James Z
  • 12,209
  • 10
  • 24
  • 44

1 Answers1

2

Looking at the BNF we can see that an <id> can begin with a letter or underscore. We can also see that an <id> can be followed by either a digit or letter and it is still a valid <id>. This is implies that an <id> begins with either a letter or underscore and can be followed by 0 or more digits or letters. This suggests the following regular expression:

id = [a-zA-Z_][0-9a-zA-Z]*
  1. [a-zA-Z_] Matches a letter or '_'
  2. [0-9a-zA-Z]* Matches a digit or letter 0 or more times.

But since you already have {digit} and {letter} already defined as individual character classes, this would be using JFlex (I am not that familiar with JFlex, so I may not have the JFlex syntax exactly right):

id = ({letter}|_)({digit}|{letter})*

This would be equivalent to the regex:

([a-zA-Z]|_)([0-9]|[a-zA-Z])*
Booboo
  • 38,656
  • 3
  • 37
  • 60
  • JFlex doesn't insist that you use macros. And it also provides predefined character classes `[:jletter:]` and `[:jletterdigit:]`, although those might not be what is being looked for since they include other Unicode letters. – rici Nov 27 '21 at 20:41
  • @rici I know and wouldn't underscore be included in `[:jletter:]` also? – Booboo Nov 27 '21 at 20:50
  • Thanks that helped. I am confused with two other rules: < character > Any character, with normal escapes \n, \t, \', \" and Comments begin with "#" and extend to the end of a line. can you help with this as well? – komal khan Nov 27 '21 at 21:11
  • I considered this related to the original question. – komal khan Nov 27 '21 at 21:29
  • I said what i said because, as I said, I am not that familiar with the specifics of JLex and I don't want to give you misinformation.. But if you want to skip a comment that begins with `#` then the regex should be `#.*\n`. The `.` in the regex will match any character except newline 0 or more times and `\n` will match the newline character. If you don't want to include the newline, then just `#.*`. But as far as the escaped character, my best guess is `\\[a-zA-Z'\"]`. The \\ matches a single '\' and `[a-zA-Z'\"]` matches a letter or single or double-quote (which must be escaped in JLex). – Booboo Nov 27 '21 at 22:14
  • 1
    Yes, it's *related* to the original question in that it is a JLex question. But it's a new JLex question. – Booboo Nov 27 '21 at 22:19
  • You can usually also backslash-escape backslashes. (Otherwise, there'd be no way to write one.) As for `[:jletter:]`, I don't think it includes underscore, but it should include things like `ñ`. – rici Nov 28 '21 at 03:46
  • acn you share any resource to learn Regular expressions thoroughly? Thankyou – komal khan Nov 28 '21 at 06:16
  • A basic online tutorial [here](https://regexone.com/). And Google **best resources to learn regular expressions**. – Booboo Nov 28 '21 at 11:03