0

I'm writing a parser to parse huge chunks of English text using attoparsec. Everything has been great so far, except for parsing this char "――". I know it is just 2 dashes together "--". The weird thing is, the parser catches it in this code:

wordSeparator :: Parser ()
wordSeparator = many1 (space <|> satisfy (inClass "――?!,:")) >> pure () 

but not in this case:

specialChars = ['――', '?', '!', ',', ':']
wordSeparator :: Parser ()
wordSeparator = many1 (space <|> satisfy (inClass specialChars)) >> pure ()

The reason I'm using the list specialChars is because I have a lot of characters to consider and I apply it multiple cases. And for the input consider: "I am ――Walt Whitman._" and the output is supposed to be {"I", "am", "Walt", "Whiteman."} I believe it's mostly because "――" is not a Char? How do I fix this?

centrinok
  • 300
  • 2
  • 11

1 Answers1

4

A Char is one character, full stop. ―― is two characters, so it is two Chars. You can fit as many Chars as you want into a String, but you certainly cannot fit two Chars into one Char.

Since satisfy considers individual characters at a time, it probably isn’t what you want if you need to parse a sequence of two characters as a single unit. The inClass function just produces a predicate on characters (inClass partially applied to one argument produces a function of type Char -> Bool), so inClass "――" is the same as inClass ['―', '―'], which is just the same as inClass ['―'] since duplicates are irrelevant. That won’t help you much.

Consider using string instead of or in combination with inClass, since it is designed to handle sequences of characters. For example, something like this might better suit your needs:

wordSeparator :: Parser ()
wordSeparator = many1 (space <|> string "――" <|> satisfy (inClass "?!,:")) >> pure ()
Alexis King
  • 43,109
  • 15
  • 131
  • 205
  • Sorry I should have indicated that I have been using Data.Text and using string "--" would cause an error. But however, I fixed it by using another inClass. You can see my updated post. – centrinok May 06 '18 at 04:21
  • @ceeks [`string` from `Data.Attoparsec.Text`](https://hackage.haskell.org/package/attoparsec-0.13.2.2/docs/Data-Attoparsec-Text.html#v:string) will work just fine with `Data.Text`. You may need to use `OverloadedStrings` or `Data.Text.pack` to produce a `Text` value for the argument to `Data.Attoparsec.Text.string`, but believe me, it really is the function you want. – Alexis King May 06 '18 at 04:23
  • Thanks for the suggestion and I will do so. But just out of sheer curiosity, what kind of implications/consequences would it have if I used another satisfy (inclass "--") ? – centrinok May 06 '18 at 04:28
  • @ceeks As I mentioned in my answer, `inClass "――"` would be precisely equivalent to `inClass ['―']`, so your parser would treat the string `――` as two distinct separators, and it would parse a single `―` as a separator. I assume this isn’t what you want, since if it was, you would just write `inClass "―"` and be done with it. – Alexis King May 06 '18 at 04:31