0

I am trying to read a PDF file. Below callbacks also print the messages but there's nothing I can get out of the PDF.

    let pdfBundlePath = Bundle.main.path(forResource: "sample", ofType: "pdf")
    let pdfURL = URL.init(fileURLWithPath: pdfBundlePath!)
    let pdf = CGPDFDocument(pdfURL as CFURL)        

    let operatorTableRef = CGPDFOperatorTableCreate()

    CGPDFOperatorTableSetCallback(operatorTableRef!, "BT") { (scanner, info) in
        print("Begin text object")
    }
    CGPDFOperatorTableSetCallback(operatorTableRef!, "ET") { (scanner, info) in
        print("End text object")
    }
    CGPDFOperatorTableSetCallback(operatorTableRef!, "Tf") { (scanner, info) in
        print("Select font")
    }
    CGPDFOperatorTableSetCallback(operatorTableRef!, "Tj") { (scanner, info) in
        print("Show text")
    }
    CGPDFOperatorTableSetCallback(operatorTableRef!, "TJ") { (scanner, info) in
        print("Show text, allowing individual glyph positioning")
    }

        let page = pdf!.page(at: 1)
        let stream = CGPDFContentStreamCreateWithPage(page!)
        let scanner = CGPDFScannerCreate(stream, operatorTableRef, nil)
        CGPDFScannerScan(scanner)
        CGPDFScannerRelease(scanner)
        CGPDFContentStreamRelease(stream)

Output:

Begin text object
Select font
Show text, allowing individual glyph positioning
End text object

// the same output for at least 10 or more times.

But I am not sure how to get the actual string out of this? Any suggestion would be appreciated.

Hemang
  • 26,840
  • 19
  • 119
  • 186
  • yes, i have the same trouble, even though i would like to search inside the pdf document. I suggest you read the Adobe specification to see what we are trying to do, and why it is no so easy :-) creating a pdf paser is much more complicated, as i could imagine before. see this https://www.slideshare.net/KazYoshikawa/extracting-text-from-pdf-ios to have an idea :-). sorry. i am not able to make a better answer now, but i am on the very early beginning on 'research' how to do the same, as you would like to do. – user3441734 May 30 '17 at 12:47
  • Thanks for the tip. However at least you able to print something out of it? That's the most priority right now. – Hemang May 30 '17 at 15:56
  • 1
    If you can read ObjC, Apple published a guide on how to do that: https://developer.apple.com/library/content/documentation/GraphicsImaging/Conceptual/drawingwithquartz2d/dq_pdf_scan/dq_pdf_scan.html#//apple_ref/doc/uid/TP30001066-CH220-BAJHBBIE Not sure how close that its to solving your problem – Code Different May 30 '17 at 15:57
  • @Hemang Unfortunately, not:-(. I decided to use a 'naive' approach. in TextEdit I wrote Hello, World!. Next I exported it as pdf and I tried open it in TextEdit again. Last two days I am trying to understand what is there :-), following Adobe specs ... Next, I did the same with 'empty' document and with the document where was only one 'space'. The results are 'surprising'. The guide, mentioned by Code Different, didn't help me at all. – user3441734 May 30 '17 at 17:39
  • @user3441734: Each operator takes a number of operands which are on the scanners "stack". Here is another project which demonstrates how to parse the contents: https://github.com/KurtCode/PDFKitten. It is also in Objective-C, but one should see the general idea. For example, the "show text" operators take a string operand which can be obtained by `CGPDFScannerPopString`. – But you'll have to read the PDF specification to see which operator takes what kind of operands, there is no way around it. – Martin R May 30 '17 at 18:44
  • @MartinR I understand. Unfortunately, it is not so easy. I have a binary data there (stream), which must be first decoded (just now I recognised, that encoding is specified in linked object dictionary as zlib/deflate, so I decoded it and I have a readable stream, as specified in Adobe specs. :-), so I am on the right way) This job is done by Apple's CoreGraphics, but I would like to have my code Apple independent. Now I can see the right operands and the (in my case unicode utf8 encoded) text entry. I am still on the very beginning, but this first step was probably the most important. – user3441734 May 30 '17 at 19:23
  • @user3441734: I don't want to discourage you, but writing your own PDF parser is a non-trivial task (the PDF spec is *huge*). The contents streams are usually flate compressed, but there are other compression methods as well. Sets of objects themselves can be compressed as "object stream". The streams can be encrypted. There are various encodings. ... – Martin R May 30 '17 at 19:39
  • @MartinR yes, i know ... now i am able to do, what i would like to do for last few days. I have a huge set of pdfs (thousends) where every page has very well defined 'page header'. I need to collect the text entries from this 'page header' and it works now :-) I don't have any ambition to write pdf parser ... I need to do the job only once. – user3441734 May 30 '17 at 20:09
  • @MartinR it seems, that i am finally on the right way while using CoreGraphics. CGPDFScannerPopString is usable for Tj operator, solution for TJ operator is in my answer. – user3441734 May 31 '17 at 10:47

1 Answers1

3

I have pdf with "hello, world" text (created with export as pdf from TextEdit)

This callback function

CGPDFOperatorTableSetCallback(operatorTableRef!, "TJ") { (scanner, info) in
    print("Show text, allowing individual glyph positioning")
    var pa: CGPDFArrayRef?
    withUnsafeMutablePointer(to: &pa, { (ppa) -> () in
        let r = CGPDFScannerPopArray(scanner, ppa)
        print("TJ", r)
        if r {
            let count = CGPDFArrayGetCount(ppa.pointee!)
            var j = 0
            for i in 0..<count {
                var str: CGPDFStringRef?
                let r = CGPDFArrayGetString(ppa.pointee!, i, &str)
                if r {
                    let string = String(cString: CGPDFStringGetBytePtr(str!)!)
                    print(string, i, j)
                    j += 1
                }
            }
        }
    })
}

prints me

Show text, allowing individual glyph positioning
TJ true
h 0 0
e 2 1
l 4 2
l 6 3
o 8 4
, 10 5
  12 6
w 14 7
o 16 8
rl 18 9
d 20 10

I think it demonstrates, that getting the String is possible :-), at least for Latin alphabet.

for Tj operator, the callback function could be as simple as

CGPDFOperatorTableSetCallback(operatorTableRef!, "Tj") { (scanner, info) in
        print("Show text")
        var text: CGPDFStringRef?
        withUnsafeMutablePointer(to: &text, { (p) -> () in
            let r = CGPDFScannerPopString(scanner, p)
            if r {
                let string = String(cString: CGPDFStringGetBytePtr(p.pointee!)!)
                print(string)
            }
        })
    }

WARNING! to properly show all characters, it is neccesary use font information, but that is a different story. For Latin characters this solutions should work as is.

To be able to 'extract' all strings, all text-showing operators must be implemented

UPDATE Because PDFKit is available on both apple platforms (from iOS11) I suggest to use it for text extraction. The process is much straightforward

user3441734
  • 16,722
  • 2
  • 40
  • 59