I'm trying to parse an ebook in .txt form, to learn more about attoparsec and Haskell (I'm a newbie). In this case, I'm trying to count the number of sentences in the given text file. Here's my code:
{-# LANGUAGE OverloadedStrings #-}
import Data.Attoparsec.Text
import qualified Data.Text as T
import qualified Data.Text.IO as Txt
import Data.List
import Control.Applicative ((<*>), (*>), (<$>), (<|>), pure)
data Prose = Prose {
word :: [Char]
} deriving Show
optional :: Parser a -> Parser ()
optional p = option () (try p *> pure ())
specialChars = ['-', '_', '…', '“', '”', '\"', '\'', '’', '@', '#', '$',
'%', '^', '&', '*', '(', ')', '+', '=', '~', '`', '{', '}',
'[', ']', '/', ':', ';', ',']
inputSentence :: Parser Prose
inputSentence = Prose <$> many1' (letter <|> digit <|> space <|> satisfy (inClass specialChars))
sentenceSeparator :: Parser ()
sentenceSeparator = many1 (space <|> satisfy (inClass ".?!")) >> pure ()
sentenceParser :: String -> [Prose]
sentenceParser str = case parseOnly wp (T.pack str) of
Left err -> error err
Right x -> x
where
wp = optional sentenceSeparator *> inputSentence `sepBy1` sentenceSeparator
main :: IO()
main = do
input <- readFile "test.txt"
let sentences = sentenceParser input
print sentences
print $ length sentences
Click this link to the github repo if you want to take a complete look at what I'm doing.
My problem is that when I try to parse text file with input:
I get an`output as follows:
So my question is, how can I:
- Make the parser realize that anything with "\n\n.." is a different sentence.
- Input like
Daniel G. Brinton
is just 1 sentence.
I've tried using isHorizontalSpace
, but to no avail.