1

I'm trying to parse an ebook in .txt form, to learn more about attoparsec and Haskell (I'm a newbie). In this case, I'm trying to count the number of sentences in the given text file. Here's my code:

{-# LANGUAGE OverloadedStrings #-}
import Data.Attoparsec.Text
import qualified Data.Text as T
import qualified Data.Text.IO as Txt
import Data.List
import Control.Applicative ((<*>), (*>), (<$>), (<|>), pure)

data Prose = Prose {
  word :: [Char]
} deriving Show

optional :: Parser a -> Parser ()
optional p = option () (try p *> pure ())

specialChars = ['-', '_', '…', '“', '”', '\"', '\'', '’', '@', '#', '$',
                '%', '^', '&', '*', '(', ')', '+', '=', '~', '`', '{', '}',
                '[', ']', '/', ':', ';', ',']

inputSentence :: Parser Prose
inputSentence = Prose <$> many1' (letter <|> digit <|> space <|> satisfy (inClass specialChars))

sentenceSeparator :: Parser ()
sentenceSeparator = many1 (space <|> satisfy (inClass ".?!")) >> pure ()

sentenceParser :: String -> [Prose]
sentenceParser str = case parseOnly wp (T.pack str) of
    Left err -> error err
    Right x -> x
    where
        wp = optional sentenceSeparator *> inputSentence `sepBy1` sentenceSeparator

main :: IO()
main = do
  input <- readFile "test.txt"
  let sentences = sentenceParser input
  print sentences
  print $ length sentences

Click this link to the github repo if you want to take a complete look at what I'm doing. My problem is that when I try to parse text file with input: enter image description here

I get an`output as follows:

enter image description here

So my question is, how can I:

  1. Make the parser realize that anything with "\n\n.." is a different sentence.
  2. Input like Daniel G. Brinton is just 1 sentence.

I've tried using isHorizontalSpace, but to no avail.

melpomene
  • 84,125
  • 8
  • 85
  • 148
centrinok
  • 300
  • 2
  • 11
  • According to your question 1, the sentence separator should rather be "at least two newline characters", e.g. `endOfLine >> many1 endOfLine` – Regis Kuckaertz May 06 '18 at 09:44
  • @ceeks can you change the images for actual text? It would make it easier to help you out. – MCH May 10 '18 at 18:03

0 Answers0