1

I parse pdf files using Quartz.

Everything works fine except for one file. Callback functions are not call at all.

My operator table has been created, I added operators into it with CGPDFOperatorTableSetCallback. Everything seem ok, just callbacks are not called.

Have you any idea what can caused this behaviour ?

bob
  • 382
  • 2
  • 17
  • Did you start the scan process? What operators are you looking for? Maybe they are not present in that specific page. – iPDFdev Jun 18 '12 at 13:23
  • Sure I start the scan process. I Added a lot of operators like BT, ET, Tj, TJ, T* ... etc. To be short, text showing and positioning operators, but also other operators. At least they must call BT and ET, because I have text in that file. The file contains one page. – bob Jun 18 '12 at 14:09
  • You said: "Everything works fine except for one file". Does this mean that other PDF files work and only one does not work? – iPDFdev Jun 18 '12 at 15:02
  • It is possible the file does not contain the operators you look for. The text you see on the page might be just lines and curves and no actual text. Or the page content might be a large form XObject which needs to be parsed separately. If you can put the file somewhere I can take a look at it and give you more details. – iPDFdev Jun 18 '12 at 19:28
  • I checked, you were right. I tried to put all operators in the table and it turned out that some callbacks are called. Here is the list of operators that have callback : cs, scn, gs, re, W, n, Do. – bob Jun 19 '12 at 09:04
  • [Here is a link](http://i.minus.com/1340183800/S4AC4X10h9xzEBozyX-lLQ/dKhH81ClIRTFh/4.pdf) to a file which has the same problem. Thank you for your help. – bob Jun 19 '12 at 09:18

1 Answers1

2

The page content is a large form XObject. Form XObjects are self contained graphic objects that use a content stream like the page.
You need to do the following: include the 'Do' operator in the list of scanned operators. When it is encountered, its operand is the symbolic name of a XObject. Get the 'Resources' dictionary from the page dictionary. From the 'Resources' dictionary get the 'XObject' dictionary. From the 'XObject' dictionary get your xobject using the symbolic name used with the 'Do' operator. From the xobject get the value of the 'Subtype' key. If it is 'Image' ignore the xobject because it is an image. If it is 'Form' then you have a form XObject. Get the stream from the xobject and scan it the same way you scanned the page content stream. You can reuse the same scanner class, you just need to keep a context in order to know what object you are scanning. Form XObjects can use other form XObjects, they being located in the parent form XObject 'Resources' dictionary.
Your page dictionary looks like this:

<<
/ArtBox[0.0 0.0 768.0 7066.0]
/BleedBox[0.0 0.0 768.0 7066.0]
/Contents 29 0 R
/CropBox[0.0 0.0 768.0 7066.0]
/Group 62 0 R
/MediaBox[0.0 0.0 768.0 7066.0]
/Parent 23 0 R
/Resources
 <<
  /ExtGState<</GS0 30 0 R>>
  /XObject<</Fm0 61 0 R>>
 >>
/Rotate 0
/TrimBox[0.0 0.0 768.0 7066.0]
/Type/Page
>> 

The 'Fm0' is the name of the form XObject used in the page content stream, the operand for the 'Do' operator. Its resources dictionary looks like this:

/Resources
 <<
  /ColorSpace<</CS0 32 0 R>>
  /ExtGState<</GS0 34 0 R/GS1 30 0 R>>
  /Font<</T1_0 38 0 R/T1_1 40 0 R>>
  /ProcSet[/PDF/Text]
  /XObject<</Fm0 45 0 R/Fm1 48 0 R/Fm2 51 0 R/Fm3 54 0 R/Fm4 57 0 R/Fm5 60 0 R>>
 >>

As you can see it uses several other form XObjects.

iPDFdev
  • 5,229
  • 2
  • 17
  • 18
  • Hi @iPDFdev, thanks for your comment, I've actually found, that my pdf file responds to "Do" operator. I have 2 page and the "Do" operator is called twice for each page The PDF is simple text, and what I want is just extract this text as string in swift, can you give me some direction what to do. This is my pdf: http://www.filedropper.com/eula_2 – Dzior Jun 08 '16 at 10:08
  • @Dzior a library like PDFKitten might help you with text extraction. – iPDFdev Jun 09 '16 at 15:14
  • Ah, thanks for your answer, we actually made the business owner to change it to text file. Thanks alot! – Dzior Jun 09 '16 at 17:59