3

For this simplified problem, I am trying to parse an input that looks like

foo bar
 baz quux 
 woo
hoo xyzzy 
  glulx

into

[["foo", "bar", "baz", "quux", "woo"], ["hoo", "xyzzy", "glulx"]]

The code I've tried is as follows:

import qualified Text.Megaparsec.Lexer as L
import Text.Megaparsec hiding (space)
import Text.Megaparsec.Char hiding (space)
import Text.Megaparsec.String
import Control.Monad (void)
import Control.Applicative

space :: Parser ()
space = L.space (void spaceChar) empty empty

item :: Parser () -> Parser String
item sp = L.lexeme sp $ some letterChar

items :: Parser () -> Parser [String]
items sp = L.lineFold sp $ \sp' -> some (item sp')

items_ :: Parser [String]
items_ = items space

This works for one block of items:

λ» parseTest items_ "foo bar\n baz quux\n woo"
["foo","bar","baz","quux","woo"]

But as soon as I try to parse many items, it fails on the first unindented line:

λ» parseTest (many items_) "foo bar\n baz quux\n woo\nhoo xyzzy\n  glulx"
4:1:
incorrect indentation (got 1, should be greater than 1)

or, with an even simpler input:

λ» parseTest (many items_) "a\nb"
2:1:
incorrect indentation (got 1, should be greater than 1)
Cactus
  • 27,075
  • 9
  • 69
  • 149
  • 3
    @ron: `parsec` does not backtrack by default, you have to use `try`, otherwise already consumed characters/tokens are "gone". You're thinking of `attoparsec`. – Zeta May 16 '16 at 15:05

1 Answers1

4

Megaparsec's author is here :-) One thing to remember when you work with Megaparsec is that it's lexer module is really “low-level” on purpose. It does not do anything you cannot build yourself, it doesn't lock you into any particular “framework”. So basicly in your case you have space consumer sp' provided for you, but you should use it carefully because it will sure fail when you have indentation level less or equal to indentation level of start of the whole fold, that's how your fold ends, by the way.

To quote the docs:

Create a parser that supports line-folding. The first argument is used to consume white space between components of line fold, thus it must consume newlines in order to work properly. The second argument is a callback that receives custom space-consuming parser as argument. This parser should be used after separate components of line fold that can be put on different lines.

sc = L.space (void spaceChar) empty empty

myFold = L.lineFold sc $ \sc' -> do
  L.symbol sc' "foo"
  L.symbol sc' "bar"
  L.symbol sc  "baz" -- for the last symbol we use normal space consumer

Line fold cannot run indefinitely so you should expect it to fail with error message similar to what you have right now. To succeed, you should think about a way for it to finish. This is usually done via using of “normal” space consumer at the end of line fold:

space :: Parser ()
space = L.space (void spaceChar) empty empty

item :: Parser String
item = some letterChar

items :: Parser () -> Parser [String]
items sp = L.lineFold sp $ \sp' ->
  item `sepBy1` try sp' <* sp

items_ :: Parser [String]
items_ = items space

item `sepBy1` try sp' runs till it fails and then sp grabs the rest, so next fold can be parsed.

λ> parseTest items_ "foo bar\n baz quux\n woo"
["foo","bar","baz","quux","woo"]
λ> parseTest (many items_) "foo bar\n baz quux\n woo\nhoo xyzzy\n  glulx"
[["foo","bar","baz","quux","woo"],["hoo","xyzzy","glulx"]]
λ> parseTest (many items_) "foo bar\n baz quux\n woo\nhoo\nxyzzy\n  glulx"
[["foo","bar","baz","quux","woo"],["hoo"],["xyzzy","glulx"]]
Cactus
  • 27,075
  • 9
  • 69
  • 149
Mark Karpov
  • 7,499
  • 2
  • 27
  • 62
  • In my real parser, `item` is much more complicated (and can contain further `foldLine` blocks) . How would I externalize the consumption of whitespace in that case? – Cactus May 17 '16 at 07:21
  • To put it in more concrete terms, do you have examples of parsing e.g. Haskell 98 with Megaparsec? – Cactus May 17 '16 at 07:21
  • @Cactus, I don't have examples that parse entire languages like Haskell98. I don't know if there are examples of that coded with Parsec which has been around for long time. `Text.Parsec.Token` certainly cannot parse Haskell. Megaparsec 5 is two days old, we have a tutorial about indentation-sensitive parsing on our site, and Haddocks but that's about it. Without seeing your code it's hard to advise but if your problem is solvable via use of monadic parser combinators, Megaparsec won't get in your way. You just need to find how to express it. – Mark Karpov May 17 '16 at 07:33
  • @Cactus, remember `Text.Megaparsec.Lexer` is just a collection of useful sortcuts that seem to make sense. – Mark Karpov May 17 '16 at 07:35
  • Fair enough. I'm marking this answer as accepted, then, and will post a more detailed example with nested `lineFold`s later. – Cactus May 17 '16 at 07:35
  • Isn't `items sp` equivalent to ```L.lexeme sp $ L.lineFold $ \sp' -> item `sepBy1` try sp'```? – Cactus May 17 '16 at 11:07
  • @Cactus, looks like it is. This is nice, isn't it? – Mark Karpov May 17 '16 at 12:14