4

I'm trying to write a parser for the Tiger language from modern compiler implementation in ML, and am stuck on one of the recursive types.

I have the following type

data LValue =                                                                                                                       
    Id Atom                                                                                                                         
    | RecordAccess LValue Atom                                                                                                      
    | ArraySubscript LValue Expression  

with the following grammar:

lvalue -> id
       -> lvalue.id
       -> lvalue[exp]
id -> atom
exp -> (the big, overarching, everything-is-an-expression type)

And I'm trying to parse it with Parsec, but I'm getting stuck in an infinitely recursive loop. Here's my current base parser:

lvalueParser :: Parsec String () LValue                                                                                             
lvalueParser =                                                                                                                      
    try (Id <$> (atomParser <* (notFollowedBy (char '.'))))                                                                         
    <|> try recordAccessParser                                                                                                      
    where recordAccessParser = (uncurry RecordAccess) <$> do {                                                                      
      record <- lvalueParser;                                                                                                         
      char '.';                                                                                                                     
      atom <- atomParser;                                                                                                           
      return (record, atom)                                                                                                      
      }

(Note: I haven't yet attempted to implement anything to handle the ArrayAccess portion)

Clearly, the infinite-loopiness happens when the recordAccessParser calls back to the lvalueParser.

I can change the recordAccessParser thusly:

recordAccessParser = (uncurry RecordAccess) <$> do {                                                                      
          record <- atomParser;                                                                                                         
          char '.';                                                                                                                     
          atom <- atomParser;                                                                                                           
          return (Id record, atom)                                                                                                      
          }

and it terminates. However, it will not parse record access more than a single level deep:

Parsec.parse lvalueParser "" "record_1.field_1.field_2"
#=> RecordAccess (Id record_1) (Id field_1)

And I expect

#=> RecordAccess (RecordAccess (Id record_1) (Id field_1)) (Id field_2)

I looked at chainl1, but the type of the chaining parser is a -> a -> a, and that doesn't match the type of LValue that reflects the grammar. I also looked at many; however I don't have a constant prefix for each term - the left recursion term is what I'm trying to parse into part of the result type.

I imagine I'm missing a specific concept of Parsec/parsing and would love to be pointed in the right direction. There are more types in the language that I'm writing a parser for that will have similar constructions.

Cœur
  • 37,241
  • 25
  • 195
  • 267
Stuart
  • 644
  • 1
  • 6
  • 13

2 Answers2

5

The usual way to parse left-recursive grammars with tools that don't support left recursion, is indeed to replace the left recursion with repetition (i.e. many). For record access that'd mean replacing a rule like

lvalue ::= lvalue '.' ID
         | primaryLValue

with

lvalue ::= primaryLValue ('.' ID)*

In terms of Parsec that would mean:

record <- atomParser                                                                                                       
fields <- many (char '.' >> idParser)

Now you have an LValue and a list of 0 or more field names, which doesn't fit your AST type, but you can solve that by folding the RecordAccess constructor over the list.

sepp2k
  • 363,768
  • 54
  • 674
  • 675
  • 1
    In effect: parse first using a different grammar, and then fix the parse tree to reflect the desired grammar. – rici Nov 24 '17 at 21:11
2

Though you cannnot use chainl1 here, You can define chainl1-like combinator like this:

leftRec :: (Stream s m t)
        => ParsecT s u m a -> ParsecT s u m (a -> a) -> ParsecT s u m a
leftRec p op = rest =<< p
  where
    rest x = do f <- op
                rest (f x)
          <|> return x

I consulted here to implement this. By using this combinator, lvalueParser can be defined as follows:

lvalueParser :: Parser LValue
lvalueParser = leftRec idParser (recordAccessModifier <|> arraySubscriptModifier)
  where
    idParser = Id <$> atomParser
    recordAccessModifier = do
      a <- char '.' *> atomParser
      return (\l -> RecordAccess l a)
    arraySubscriptModifier = do
      e <- between (char '[') (char ']') expParser
      return (\l -> ArraySubscript l e)

Example:

main = parseTest lvalueParser "x.y[2].z"
-- => RecordAccess (ArraySubscript (RecordAccess (Id 'x') 'y') (ENat 2)) 'z'

where Atom, Expression, and their parsers are defined as follows:

type Atom = Char
atomParser :: Parser Atom
atomParser = letter <?> "atom"

data Expression = ENat Int
  deriving Show
expParser :: Parser Expression
expParser = (ENat . read) <$> many digit
Hogeyama
  • 748
  • 4
  • 10