How to properly parse indented block with megaparsec?

Question

I'm trying to make an indentation based programming language, and I'm trying to parse something like:

expr1 :
  expr2
  expr3

Here, essentially : indicates the start of a new indentation block, so expr1 is completely irrelevant, the idea is that : can appear anywhere in the line, and must be the last token of the line.

I got this code that more or less works:

block :: Parser Value
block = dbg "block" $ do
  void $ symbol ":"
  void $ eol
  space1
  (L.indentBlock spaceConsumer indentedBlock)
  where
    indentedBlock = do
      e <- expr
      pure (L.IndentMany Nothing (\exprs -> pure $ Block () (e : exprs)) expr)

But the issue is that in the example, only the first expression of the block is parsed with the proper indentation, the others must be more indented, like this

expr1 :
  expr2
   expr3
   expr4
   expr5

score 2 · Accepted Answer · answered Oct 26 '20 at 10:11

I ended up parsing expr1 right in the same place as the :

Apparently indentBlock starts counting from the column where the parser passed as the last parameter begins, so the idea is to begin parsing from the beginning of the line (relative to current indentation level), it ended up being like this:

block :: Parser Value
block =
  L.indentBlock spaceConsumer indentedBlock
  where
    indentedBlock = do
      caller <- callerExpression
      args <- parseApplicationArgs
      pure (L.IndentSome Nothing (exprsToAppBlock caller args) parse)
    exprsToAppBlock caller args exprs =
      pure (Application () caller (args <> [Block () exprs]))

Liarokapis Alexandros · Answer 2 · 2020-10-26T09:55:29.947

I usually add the following combinators:

import qualified Text.Megaparsec.Char.Lexer as L

indented :: Pos -> Parser a -> Parser (Pos, a)
indented ref p = do pos <- L.indentGuard space GT ref 
                    v <- p
                    pure (pos, v)
        

aligned :: Pos -> Parser a -> Parser a
aligned ref p = L.indentGuard space EQ ref *> p

Then you can use L.indentLevel to get the reference indentation.

Here is an example of parsing a block of statements including error handling:

blocked1 :: Pos -> Parser a -> Parser [a]
blocked1 ref p = do (pos, a) <- indented ref p
                    rest <- many (try $ helper pos)
                    fpos <- getPosition
                    rest' <- traverse (reportErrors pos) rest
                    setPosition fpos
                    pure (a : rest')
    where helper pos' = do pos <- getPosition
                           a <- p
                           when (sourceColumn pos <= ref) $ L.incorrectIndent EQ pos' (sourceColumn pos)
                           pure (pos, a)
          reportErrors ref (pos, v) = setPosition pos *>
            if ref /= sourceColumn pos
               then L.incorrectIndent EQ ref (sourceColumn pos)
               else pure v
                
blocked :: Pos -> Parser a -> Parser [a]
blocked ref p = blocked1 ref p <|> pure []

block :: Pos -> Parser (Block ParserAst)
block ref = do
       s <- blocked1 ref stmt
       pure $ Block s


funcDef :: Parser (FuncDef ParserAst)
funcDef = annotate $
    do pos <- L.indentLevel 
       symbol "def"
       h <- header
       l <- localDefs 
       b <- block pos
       pure $ FuncDef h l b

score 0 · Answer 3 · answered Oct 26 '20 at 01:32

I cannot offer megaparsec specific advice as I don't know that particular library, however I can gift you my wisdom from writing a few indent sensitive language parsers: Your life will be far easier if you lex and parse in separate steps and add indent_begin and indent_end during the lexicographic analysis.

How to properly parse indented block with megaparsec?

3 Answers3