5

I'm trying to separate a string by either ",", ", and" and "and", and then return whatever was in between. An example of what I have so far is as follows:

import Data.Attoparsec.Text

sepTestParser = nameSep ((takeWhile1 $ inClass "-'a-zA-Z") <* space)
nameSep p = p `sepBy` (string " and " <|> string ", and" <|> ", ")

main = do
  print $ parseOnly sepTestParser "This test and that test, this test particularly."

I would like the output to be ["This test", "that test", "this test particularly."]. I have a vague sense that what I'm doing is wrong, but I can't quite work out why.

duplode
  • 33,731
  • 7
  • 79
  • 150
oneway
  • 753
  • 1
  • 5
  • 9
  • Why not `nameSep . takeWhile1 $ inClass " \t-'a-zA-Z"`? - your output clearly doesn't treat spaces differently, why not include them in the character class? If you like `space` instead of explicit characters like `" \t"` etc, you could use `nameSep . takeWhile1 $ inClass "-'a-zA-Z" <|> space` – AndrewC Jun 10 '14 at 16:41

1 Answers1

4

Note: This answer is written in literate Haskell. Save it as Example.lhs and load it in GHCi or similar.

The thing is, sepBy is implemented as:

sepBy p s = liftA2 (:) p ((s *> sepBy1 p s) <|> pure []) <|> pure []

This means that the second parser s will be called after the first parser has succeeded. This also means, that if you were to add whitespace to the class of characters, that, you would end up with

["This test and that test","this test particularly"]

since and is now parseable by p. This isn't easy to fix: you would need to look ahead as soon as you hit a space, and check if after an arbitrarily number of spaces an "and" follows, and if it does, stop parsing. Only then a parser written with sepBy will work.

So lets write a parser that takes words instead (the rest of this answer is literate Haskell):

> {-# LANGUAGE OverloadedStrings #-}
> import Control.Applicative
> import Data.Attoparsec.Text
> import qualified Data.Text as T
> import Control.Monad (mzero)

> word = takeWhile1 . inClass $ "-'a-zA-Z"
> 
> wordsP = fmap (T.intercalate " ") $ k `sepBy` many space
>   where k = do
>           a <- word
>           if (a == "and") then mzero
>                           else return a

wordsP now takes multiple words until it either hits something, that's not a word, or a word that equals "and". The returned mzero will indicate a parsing failure, at which another parser can take over:

> andP = many space *> "and" *> many1 space *> pure()
> 
> limiter = choice [
>     "," *> andP,
>     "," *> many1 space *> pure (),
>     andP
>   ]

limiter is mostly the same parser you've already written, it's the same as the regex /,\s+and|,\s+|\s*and\s+/.

Now we can actually use sepBy, since our first parser doesn't overlap with the second anymore:

> test = "This test and that test, this test particular, and even that test"
>
> main = print $ parseOnly (wordsP `sepBy` limiter) test

The result is ["This test","that test","this test particular","even that test"], just as we wanted. Note that this particular parser doesn't preserve whitespace.

So whenever you want to create a parser with sepBy, make sure that both parsers don't overlap.

Zeta
  • 103,620
  • 13
  • 194
  • 236
  • @atc: This is literal Haskell. Therefore, the `>` at the beginning of the line are significant. – Zeta Apr 05 '16 at 13:56
  • Oh I missed something? How is it significant? Looked like straight sourcecode to me, and thus copy/paste-able into a repl or source file. The > made that cumbersome. What does `>` do here? – Alex Apr 05 '16 at 14:49
  • 1
    It's not copy/paste-able into a REPL either way, since most of the functions are defined on multiple lines. [I've linked an explanation](https://wiki.haskell.org/Literate_programming), but here's a short summary: you can simply copy and paste __all__ of the answer (including the explanation/text) into a `.lhs` file and load that in GHCi (or compile it with GHC). Note that some lines aren't meant to be compiled, e.g. the first `sepBy p …` explanation or the top-level list; they wouldn't work either way. – Zeta Apr 05 '16 at 14:55
  • 1
    (Literate programming basically swaps comments and code; everything is a comment, unless you start a line with `>`) – Zeta Apr 05 '16 at 14:56