4

I'm writing a pdf reader iPhone application.

I know how to show pdf file in view using CGPDF** classes in iOS.

What I want to do now is to search text in pdf file, and highlight the searched text. So, I need a library which can detect what text is in what position. Besides, I want the library able to handle unicode and Chinese characters.

I've searched for a few days but still cannot find anything suitable.

I've tried xpdf, but it is written in c++. I don't know how to use c++ code in iPhone app.

I've also tried http://www.codeproject.com/KB/cpp/ExtractPDFText.aspx but it does not handle Chinese characters.

I've tried to code by myself, but the encoding in PDF is really complicated.

For example, I don't know what to refer to when I want to decode the text by the following font:

8 0 obj
<< /Type /Font /Subtype /Type0 /Encoding /Identity-H /BaseFont /RNXJTV+PMingLiU
/DescendantFonts [ 157 0 R ] >>
endobj

157 0 obj
<< /Type /Font /Subtype /CIDFontType2 /BaseFont /RNXJTV+PMingLiU /CIDSystemInfo
<< /Registry (Adobe) /Ordering (CNS1) /Supplement 0 >> /FontDescriptor 158 0 R
/W 161 0 R /DW 1000 /CIDToGIDMap 162 0 R >>
endobj

158 0 obj
<< /Type /FontDescriptor /Ascent 801 /CapHeight 711 /Descent -199 /Flags 32
/FontBBox [0 -199 999 801] /FontName /RNXJTV+PMingLiU /ItalicAngle 0 /StemV
0 /Leading 199 /MaxWidth 1000 /XHeight 533 /FontFile2 159 0 R >>
endobj
user549683
  • 61
  • 1
  • 5
  • 1
    C++ in iPhone app: Objective-C++.. try using the `.mm` extension and go from there... here is a link to get you started: http://iphonedevelopertips.com/cpp/c-on-iphone-part-1.html – Richard J. Ross III Jan 03 '11 at 12:51
  • Thanks a lot! I make some changes and finally make the C++ library works. Chinese characters decode correctly! Now I'm trying hard to understand where to get the position information. – user549683 Jan 20 '11 at 04:41
  • Hey! I want also to search text in pdf file, and highlight the searched text. Which way you choosed at least. Is it working? – János May 18 '11 at 06:12

3 Answers3

4

Take a look at the CGPDFScanner type; it can be used to parse through a PDF document for strings and particular PDF operators.

Jonathan Grynspan
  • 43,286
  • 8
  • 74
  • 104
3

This code is having some bugs that can be easily fixable. Well presented Objective C code.

https://github.com/KurtCode/PDFKitten

Naveen Thunga
  • 3,675
  • 2
  • 24
  • 31
0

CGPDFScanner can only scan the pdf contents but there is no way you can find the location of the word in the pdf. So highlighting is not possible using cgpdf functions. Also the scanner output is encoded text for flateDecoded and other types of pdf. It can only scan simple pdfs i.e Linear pdfs. (Open the pdf as text file and at the top you will find the word Linearized pdf.) Possible solution would be using a c or c+ parsing library if you get one. Also the cpp project from the code project will only parse the content but wont give any location information. Writing a pdf parser on your own is complex because pdf formats are complicated and not fixed. Pdf content can be encoded in different ways like FlateDecode type etc.

Snehal
  • 597
  • 3
  • 10
  • 2
    It is definitely possible to find out the positions of words on the page using CGPDFScanner (I've developed an app that does this), it's just *a lot* of work. Your comment about it only being able to scan "simple" pdfs is incorrect, it handles pretty much every pdf. Also, linearized pdf is not some simplified form of pdf, it is a variant specifically optimized for streaming, has nothing to do with the encoding. – omz Jan 15 '11 at 04:40
  • oh. Thats great and go to hear. I tried this a lot and came to such a conclusion. Thanks to add a comment to this saying that it is possible. Please could you guide me how to search a word in pdf. Really need your help. – Snehal Jan 17 '11 at 06:36
  • Scan the "Tm" operator will get six numbers which are something related to position of word. You may also try "cm" operator and search width information in font. CGPDFStreamCopyData can decode text with FlateDecode. – user549683 Jan 20 '11 at 04:32
  • ok.Thanks for your answer. But I am bit confused. Will I have to write a callback method for operator "Tm" as it is done for "MP operator in the apple document and then in callback method scan the contents? – Snehal Jan 21 '11 at 11:10
  • Yes. Use CGPDFScannerPopNumber in the callback method to scan for number. – user549683 Jan 24 '11 at 06:13
  • I added the callback for Tm operator in the table. But it gives me compilation error that the callback is unrecognized. – Snehal Jan 27 '11 at 11:45
  • You need to define the callback somewhere above where you use it. [See this](http://www.random-ideas.net/posts/42) – user549683 Feb 14 '11 at 09:24