1

I've been trying to parse a .txt file with some English text in it. My code tries to return back the number of paragraphs in that .txt file. For some reason, attoparsec can't seem recognize a newline or any other characters such as \n\r\t. Below is my code. I've also tried using many1 (satisfy (inClass "\n\r\t")), but still no luck. What do you think the underlying problem is? Also here's the link to the sample text file I've been testing it on.

import Data.Attoparsec.Text
import qualified Data.Text as T
import qualified Data.Text.IO as Txt

newtype Prose = Prose {
  word :: [Char]
}

instance Show Prose where
  show a = word a

optional :: Parser a -> Parser ()
optional p = option () (try p *> pure ())

specialChars = ['-', '_', '…', '“', '”', '\"', '\'', '’', '@', '#', '$',
                '%', '^', '&', '*', '(', ')', '+', '=', '~', '`', '{', '}',
                '[', ']', '/', ':', ';', ',']

inputPara :: Parser Prose
inputPara = Prose <$> many1' (letter <|> digit <|> space <|> satisfy (inClass specialChars) <|> satisfy (inClass "――.?!") )

paraSeparator :: Parser ()
paraSeparator = many1 (satisfy (isEndOfLine) <|> satisfy (isHorizontalSpace)) >> pure ()

paraParser :: String -> [Prose]
paraParser str = case parseOnly wp (T.pack str) of
    Left err -> error err
    Right x -> x
    where
      wp = optional paraSeparator *> inputPara `sepBy1` paraSeparator

main :: IO()
main = do
  input <- readFile "test.txt"
  let para = paraParser input
  print para
  print $ length para
duplode
  • 33,731
  • 7
  • 79
  • 150
centrinok
  • 300
  • 2
  • 11

1 Answers1

0

The problem is that space parser in the following line:

inputPara = Prose <$> many1' (letter <|> digit <|> space <|> satisfy (inClass specialChars) <|> satisfy (inClass "――.?!") )

matches characters such as \n\r\t (every char which is isSpace)

That's why inputPara matches the whole text without separation.

One of the solutions could be removing space parser from inputPara and add ' ' character into specialChars

For example, the following code should work, but certainly feel free to choose the option, which works best for you:

import Data.Attoparsec.Text
import qualified Data.Text as T
import qualified Data.Text.IO as Txt
import Control.Applicative ((<|>))

newtype Prose = Prose {
  word :: [Char]
}

instance Show Prose where
  show a = word a

optional :: Parser a -> Parser ()
optional p = option () (try p *> pure ())

specialChars = ['-', '_', '…', '“', '”', '\"', '\'', '’', '@', '#', '$',
                '%', '^', '&', '*', '(', ')', '+', '=', '~', '`', '{', '}',
                '[', ']', '/', ':', ';', ',', ' ']

inputPara :: Parser Prose
inputPara = Prose <$> many1' (letter <|> digit <|> satisfy (inClass specialChars) <|> satisfy (inClass "――.?!") )

paraSeparator :: Parser [Char]
paraSeparator = many1 space

paraParser :: String -> [Prose]
paraParser str = case parseOnly wp (T.pack str) of
    Left err -> error err
    Right x -> x
    where
      wp = optional paraSeparator *> inputPara `sepBy1` paraSeparator

main :: IO()
main = do
  input <- readFile "test.txt"
  let para = paraParser input
  print para
  print $ length para
Igor Drozdov
  • 14,690
  • 5
  • 37
  • 53
  • but the above code would separate the text by words. I want the text to be separated by paras. The output for the pastebin link I attached should return a length of 3. – centrinok May 08 '18 at 16:28