1

I'm writing a JavaScript parser with Happy and I need to match a regular expression. I don't want to fully parse the regex, just store it as a string.

The relevant part of my AST looks like this:

data PrimaryExpr
    -- | Literal integer
    = ExpLitInt     Integer
    -- | Literal strings
    | ExpLitStr     String
    -- | Identifier
    | ExpId         String
    -- | Bracketed expression
    | ExpBrackExp   Expression
    -- | This (current object)
    | ExpThis
    -- | Regular Expression
    | ExpRegex      String
    -- | Arrays
    | ExpArray      ArrayLit
    -- | Objects
    | ExpObject     [(PropName, Assignment)]
    deriving Show

This is the relevant Happy code:

primaryExpr :: { PrimaryExpr }
    : LITINT          { ExpLitInt $1 }
    | LITSTR          { ExpLitStr $1 }
    | ID              { ExpId $1 }
    | THIS            { ExpThis }
    | regex           { ExpRegex $1 }
    | arrayLit        { ExpArray $1 }
    | objectLit       { ExpObject $1 }
    | '(' expression ')' { ExpBrackExp $2 }

My question is, how should I define my regex non-terminal? Is this kind of structure right?

regex :: { String }
    : '/' whatHere? '/' { $2 }
Nick Brunt
  • 9,533
  • 10
  • 54
  • 83

2 Answers2

3

You should define regex as a terminal that is recognized by the lexer (i.e. LITREGEX).

primaryExpr :: { PrimaryExpr }
    : LITINT          { ExpLitInt $1 }
    | LITSTR          { ExpLitStr $1 }
    | LITREGEX        { ExpRegex $1 }
    | ID              { ExpId $1 }
    | THIS            { ExpThis }
    | arrayLit        { ExpArray $1 }
    | objectLit       { ExpObject $1 }
    | '(' expression ')' { ExpBrackExp $2 }
pat
  • 12,587
  • 1
  • 23
  • 52
  • Ok, good idea. That leads on to the next question - how do I get an Alex lexer to match a regular expression? (I could ask this as a separate question if you think that's a better idea?) – Nick Brunt Feb 14 '12 at 02:27
  • I'm no Alex expert, but something like `\/[^\/]*\/ { \s -> LITREGEX . init . tail $ s }`. This doesn't allow for escaped `/`'s in the regex. YMMV – pat Feb 14 '12 at 02:33
  • To do this properly, you'll need to deal with backslash-escaped `/`'s, and `/`'s inside character classes. – pat Feb 14 '12 at 02:36
  • Yes, that's the problem I'm coming up with. It's perfect other than that though. It matches simple regexes, just not ones with escaped forward slashes. I'll work on it, thank you! – Nick Brunt Feb 14 '12 at 02:43
  • Final solution: `\/([^\/]|\\\/)*\/[gim]* { \s -> Regex s }` – Nick Brunt Feb 14 '12 at 03:01
  • This still allows the regex to terminate if there's an unescaped forward slash inside a character class... – pat Feb 14 '12 at 03:07
  • You can use a start code to change the rules when you're inside a regex, and another start code to change the rules again when you're inside a character class inside a regex. – pat Feb 14 '12 at 03:36
  • How does that work? Would this work? `\/([^\/]|\\\/|\[[^\]]*\/)*\/[gim]* { \s -> Regex s }` – Nick Brunt Feb 14 '12 at 03:37
3

To answer the question in the comment, need a bit more room.

Something like (spaced out and commented):

/             forward slash
(  \\.        either: an escaped character
|  [^\[/\\]           anything which isn't / or [ or \
|  \[                 a character class containing:
     [^\]]*              anything which isn't ] any number of times
   \]                   
)*            any number of times
/             forward slash

Condensed:

/(\\.|[^\[/\\]|\[[^\]]*\])*/
porges
  • 30,133
  • 4
  • 83
  • 114
  • This is great, cheers. I think it's nicer than mine anyway. I had to escape the forward slashes and add modifiers so here is the end product: `\/(\\.|[^\[\/]|\[[^\]]*\])*\/[gim]* { \s -> Regex s }` – Nick Brunt Feb 14 '12 at 04:10
  • I was going to say I wasn't sure what you'd need to escape for Happy but you seem to have figured it out :) – porges Feb 14 '12 at 04:41
  • Character classes allow the first character to be a ] without it closing the class – pat Feb 14 '12 at 05:07
  • @pat: sometimes. Not in Javascript. I got bitten by this recently :) Here it is in Chrome: http://i.imgur.com/UFtKt.png – porges Feb 14 '12 at 05:12
  • @pat: PCRE and PCRE-compatible stuff allows it. They also allow `[^]]` :) – porges Feb 14 '12 at 05:15
  • @Porges This was working great for me, but I've come up against some cases where it matches too much. If the input is `a.replace(/\\/g, '/');`, it matches `/\\/g, '/` rather than stopping after the g. Can you shed any light on why it's doing this? – Nick Brunt Feb 28 '12 at 08:58
  • @NickBrunt: I can't replicate that in my test (which isn't Happy, admittedly)... it will match that much if you forget to escape the "a.replace(/\\/g, '/');" correctly. Is this being read in from a file or are you typing it in as a test? – porges Feb 28 '12 at 10:22
  • For example, if that's going into a string in Haskell it would look like `"a.replace(/\\\\/g, '/');"` – porges Feb 28 '12 at 10:26
  • @Porges First of all, thank you very much for replying at all on an old question! It's being read in from a file so the escapes will be put in automatically. It's actually part of a very large JavaScript library but I factored it down to this piece of code that is giving me the problem. I now have that on its own in a file and I get the same error. It matches the regex as `/\\/g, '/`. Another example is this: `a = 0.5 / 2; // Comment` where it matches the regex as `/ 2; /`. I'm baffled as to why it's doing it. I can put my program in pastebin or something for you if you want? – Nick Brunt Feb 28 '12 at 12:26
  • @Porges I've put a zip file on my server with my lexer in it. Feel free to have a look if you want. I'll probably take this down once I solve the problem, sorry future people. http://nickbrunt.com/uni/Lexer.zip – Nick Brunt Feb 28 '12 at 13:01
  • @NickBrunt: I see where the first problem is. The first character class in my regex is `[^\[/\\]`, but when you've copied/escaped it, it is `[^\[\/]`. This should be `[^\[\/\\]`. The second problem is a bit harder... the ECMAScript spec http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-262.pdf defines some grammars for RegExp, but then *never uses them*! There's some more discussion on this on the Mozilla mailing list: http://old.nabble.com/Exactly-where-is-a-RegularExpressionLiteral-allowed--to22666418.html#a22666418 To handle this correctly, you're going to smarten the lexer. – porges Feb 28 '12 at 20:29
  • @NickBrunt: If you look at the edit history, I actually changed my answer only 3 minutes before you posted that comment... so you must have picked it up before I did that. Sorry, I should have checked the example code you posted. – porges Feb 28 '12 at 20:30
  • @Porges Ah yes, of course, my bad. I did look over your answer again but clearly not closely enough! I have solved the second problem as well with a pretty nasty hack, but at least it works now. The problem cropped up a few times and I noticed that it's always when there is a comment at the end of the line. It always matches the first `/` of the comment and the last `/` of the regex, so I added `[^\/]` at the end of my regex so that it makes sure that the next character is not a `/`, thus eliminating the possibility of matching the first `/`. I am now obviously matching one too many chars – Nick Brunt Feb 28 '12 at 21:24
  • so I had to add this to my lexer function to right that wrong: `Regex s -> Left (Regex (init s), (last s):xs)` where the tuple is the token followed by the rest of the code. This hack promptly brought up another bug. It now won't match regexes followed by a newline, so I changed the regex AGAIN to this which is the current working (fingers crossed) version: `\/(\\.|[^\[\/\\]|\[[^\]]*\])*\/[gim]*([^\/]|[\n\r]) { \s -> Regex s }`. Phew, well it works for now. Thanks so much for your help! P.S. I've added a bounty of 100 reputation which I'm going to give to you for your help. – Nick Brunt Feb 28 '12 at 21:24