5

I have a simple attoparsec-based pdf parser. It works fine until used with iteratee. When size of input exceeds buffer size.

import qualified Data.ByteString as BS
import qualified Data.Iteratee as I
import qualified Data.Attoparsec as P
import qualified Data.Attoparsec.Iteratee as P
import System.Environment (getArgs)
import Control.Monad

import Pdf.Parser.Value

main :: IO ()
main = do
  [i] <- getArgs
  liftM (P.parseOnly parseValue) (BS.readFile i) >>= print  -- works
  I.fileDriverRandomVBuf 2048 (P.parserToIteratee parseValue) i >>= print  -- works
  I.fileDriverRandomVBuf 1024 (P.parserToIteratee parseValue) i >>= print  -- DOES NOT works!!!

Input:

<< /Annots [ 404 0 R 547 0 R ] /ArtBox [ 0.000000 0.000000 612.000000 792.000000 ] /BleedBox [ 0.000000 0.000000 612.000000 792.000000 ] /Contents [ 435 0 R 436 0 R 437 0 R 444 0 R 448 0 R 449 0 R 450 0 R 453 0 R ] /CropBox [ 0.000000 0.000000 612.000000 792.000000 ] /Group 544 0 R /MediaBox [ 0.000000 0.000000 612.000000 792.000000 ] /Parent 239 0 R /Resources << /ColorSpace << /CS0 427 0 R /CS1 427 0 R /CS2 428 0 R >> /ExtGState << /GS0 430 0 R /GS1 431 0 R /GS2 469 0 R /GS3 475 0 R /GS4 439 0 R /GS5 480 0 R /GS6 485 0 R /GS7 491 0 R /GS8 497 0 R >> /Font << /C2_0 447 0 R /T1_0 421 0 R /T1_1 422 0 R /T1_2 423 0 R /T1_3 424 0 R /T1_4 425 0 R /T1_5 426 0 R /T1_6 438 0 R >> /ProcSet [ /PDF /Text /ImageC /ImageI ] /Properties << /MC0 << /Metadata 502 0 R >> >> /XObject << /Fm0 451 0 R /Fm1 504 0 R /Fm2 513 0 R /Fm3 515 0 R /Fm4 517 0 R /Fm5 526 0 R /Fm6 528 0 R /Fm7 537 0 R /Fm8 539 0 R /Im0 540 0 R /Im1 541 0 R /Im2 452 0 R /Im3 542 0 R /Im4 543 0 R >> >> /Rotate 0 /StructParents 1 /TrimBox [ 0.000000 0.000000 612.000000 792.000000 ] /Type /Page >>

So, the parser works without iteratee, works with big enough chunks, but doesn't work with smaller chunks. Bug in iteratee? In attoparsec-iteratee? In my code? Is there any workaround? It is a really urgent issue for me.

Thanks.

Daniel Fischer
  • 181,706
  • 17
  • 308
  • 431
Yuras
  • 13,856
  • 1
  • 45
  • 58
  • No idea where the bug is, but is it possible to just use a large enough chunk size? Or to use `ByteString`s instead of `Iteratees`? – Daniel Fischer Jan 24 '12 at 23:33
  • pdf value can be arbitrary long, so there are no large enough chunk size. Re ByteString: do you mean lazy IO? Pdf requires random access, and reference table usually is located at the end of file. So lazy IO ~= "strict" in this particular case and will use memory inefficiently. – Yuras Jan 24 '12 at 23:45
  • Do `Iteratee`s allow random access? I've not heard of that (doesn't mean anything, I'm not a user). If you need random access, either read the entire file at once or have some scaffolding to seek and read parts of the file. If possible, the first option is **much** simpler. – Daniel Fischer Jan 24 '12 at 23:53
  • @DanielFischer: the iteratee package does allow random access (`zoom-cache` has a fairly sophisticated system built on top of this). – John L Jan 25 '12 at 00:25
  • Yes, iteratee supports `seek`. It is it's the most important feature for me comparing with enumerators. Reading the entire file is inefficient -- megabytes of memory just to take document author. And I'm too lazy for manual seek/read algorithm :) The issue in question is the only thing that doesn't work. – Yuras Jan 25 '12 at 00:30
  • @Daniel, ty for editing and fixing my question. And sorry my english :} – Yuras Jan 25 '12 at 00:31
  • @Yuras Your English is perfectly understandable, that's the important thing. And that makes it easy to fix the small errors to bring the question closer to perfection. – Daniel Fischer Jan 25 '12 at 00:41
  • What version of Attoparsec are you using? In versions prior to 0.10, parsers were not guaranteed to backtrack. Now, it [seems they are](http://hackage.haskell.org/packages/archive/attoparsec/latest/doc/html/Data-Attoparsec-ByteString.html#v:try). – Joey Adams Jan 25 '12 at 01:07
  • @JoeyAdams I'm using attoparsec-0.10.0.3. There is newer version on hackage, I'll try it tomorrow. – Yuras Jan 25 '12 at 01:16
  • Ah, you are using Attoparsec 0.10, given that you import `Data.Attoparsec.ByteString`. I thought you had some missing `try`s, but that is no longer relevant. – Joey Adams Jan 25 '12 at 01:18

1 Answers1

2

Edit 2: I created a new parser in Pdf/Parser/Value

dictOrStream :: Parser PdfValue
dictOrStream = do
  dict <- parseDict
  P.skipSpace
  let s1 = do
            P.string $ fromString "stream"
            content <- P.manyTill P.anyWord8 $ P.endOfLine >> P.string (fromString "endstream")
            return $ PdfValStream (PdfStream dict (BS.pack content))
  s1 <|> return (PdfValDict dict)

then used this parser in parseValue. This works for all your cases. I don't know why choice fails to backtrack properly, maybe an attoparsec bug?

Edit: I notice that, if I replace your top-level parseValue with parseDict, it works. It also works if I remove parseStream from the choices in parseValue. I think attoparsec has committed to "parseStream" after the completion of the top-level dictionary, therefore it's expecting more input (a space, the "stream" token, etc.) leading to this error. At this point there's an ambiguity between these two parsing options that you'll need to resolve. I don't know why it works properly when the entire input is available; I would expect an error to be reported as when your parser is fed chunks.

As of now, I suspect a bug in either your code, or possibly attoparsec. I ran the following test by manually reading bytestring chunks and feeding it to your attoparsec parser:

*Main System.IO> h <- openFile "test.pdf" ReadMode
*Main System.IO Data.ByteString> let hget = hGetSome h 1024
*Main System.IO Data.ByteString> b <- hget
*Main System.IO Data.ByteString> let r = P.parse parseValue b
*Main System.IO Data.ByteString> r
Partial _
*Main System.IO Data.ByteString> b <- hget
*Main System.IO Data.ByteString> let r' = P.feed r b
*Main System.IO Data.ByteString> r'
Partial _
*Main System.IO Data.ByteString> b <- hget
*Main System.IO Data.ByteString> Data.ByteString.length b
0
*Main System.IO Data.ByteString> let r'2 = P.feed r' b
*Main System.IO Data.ByteString> r'2
Fail "<< /Annots [ 404 0 R 547 0 R ] /ArtBox [ 0.000000 0.000000 612.000000 792.000000 ] /BleedBox [ 0.000000 0.000000 612.000000 792.000000 ] /Contents [ 435 0 R 436 0 R 437 0 R 444 0 R 448 0 R 449 0 R 450 0 R 453 0 R ] /CropBox [ 0.000000 0.000000 612.000000 792.000000 ] /Group 544 0 R /MediaBox [ 0.000000 0.000000 612.000000 792.000000 ] /Parent 239 0 R /Resources << /ColorSpace << /CS0 427 0 R /CS1 427 0 R /CS2 428 0 R >> /ExtGState << /GS0 430 0 R /GS1 431 0 R /GS2 469 0 R /GS3 475 0 R /GS4 439 0 R /GS5 480 0 R /GS6 485 0 R /GS7 491 0 R /GS8 497 0 R >> /Font << /C2_0 447 0 R /T1_0 421 0 R /T1_1 422 0 R /T1_2 423 0 R /T1_3 424 0 R /T1_4 425 0 R /T1_5 426 0 R /T1_6 438 0 R >> /ProcSet [ /PDF /Text /ImageC /ImageI ] /Properties << /MC0 << /Metadata 502 0 R >> >> /XObject << /Fm0 451 0 R /Fm1 504 0 R /Fm2 513 0 R /Fm3 515 0 R /Fm4 517 0 R /Fm5 526 0 R /Fm6 528 0 R /Fm7 537 0 R /Fm8 539 0 R /Im0 540 0 R /Im1 541 0 R /Im2 452 0 R /Im3 542 0 R /Im4 543 0 R >> >> /Rotate 0 /StructParents 1 /TrimBox [ 0.000000 0.000000" [] "Failed reading: empty"

For some reason, your parser doesn't seem to like receiving data in chunks, and fails upon receiving the third (empty) chunk without consuming any input. I haven't yet figured out where your parser is going wrong, but it's definitely not iteratee or attoparsec-iteratee.

John L
  • 27,937
  • 4
  • 73
  • 88
  • You are right, looks like both the iteratee and attoparsec-iteratee have nothing to do with that. ty, John – Yuras Jan 25 '12 at 00:49
  • Could you please explain why it is ambiguous? I expect `parserDict` will fail if don't find "stream", and `choice` will try the next option -- `parseDict`. – Yuras Jan 25 '12 at 01:19
  • sorry, I mean that `parseStream` will fail if don't find "stream" – Yuras Jan 25 '12 at 01:29
  • @Yuras - I meant it was ambiguous because, after the end of the first dictionary, either would still be a possible match. I also would expect `parseStream` to fail and for it to try the next option; I'm not certain why it isn't working. – John L Jan 25 '12 at 08:58
  • Thanks, the workaround works for me. I'll accept the answer since it fixes the issue for me. And I'll email `attoparsec` maintainer and ask him to clarify the issue. – Yuras Jan 25 '12 at 11:53