I've been trying to parse a .txt file with some English text in it. My code tries to return back the number of paragraphs in that .txt file. For some reason, attoparsec can't seem recognize a newline or any other characters such as \n\r\t
. Below is my code. I've also tried using many1 (satisfy (inClass "\n\r\t"))
, but still no luck. What do you think the underlying problem is? Also here's the link to the sample text file I've been testing it on.
import Data.Attoparsec.Text
import qualified Data.Text as T
import qualified Data.Text.IO as Txt
newtype Prose = Prose {
word :: [Char]
}
instance Show Prose where
show a = word a
optional :: Parser a -> Parser ()
optional p = option () (try p *> pure ())
specialChars = ['-', '_', '…', '“', '”', '\"', '\'', '’', '@', '#', '$',
'%', '^', '&', '*', '(', ')', '+', '=', '~', '`', '{', '}',
'[', ']', '/', ':', ';', ',']
inputPara :: Parser Prose
inputPara = Prose <$> many1' (letter <|> digit <|> space <|> satisfy (inClass specialChars) <|> satisfy (inClass "――.?!") )
paraSeparator :: Parser ()
paraSeparator = many1 (satisfy (isEndOfLine) <|> satisfy (isHorizontalSpace)) >> pure ()
paraParser :: String -> [Prose]
paraParser str = case parseOnly wp (T.pack str) of
Left err -> error err
Right x -> x
where
wp = optional paraSeparator *> inputPara `sepBy1` paraSeparator
main :: IO()
main = do
input <- readFile "test.txt"
let para = paraParser input
print para
print $ length para