0

I am using Foxit SDK to extract the text from Pdf document .

Everything is okay but when I extract a pdf in other languages rather than English I don't get the correct output .

I have also used PDFBox in java but that gives me the worst output, output from Foxit SDK is better than PDFBox.

Are there ant other libraries which can solve the issue..? Or there is some other solution.

gprathour
  • 14,813
  • 5
  • 66
  • 90
Tushar Agarwal
  • 521
  • 1
  • 16
  • 39
  • have you tried this. http://www.codeproject.com/Articles/14170/Extract-Text-from-PDF-in-C-100-NET – Shoaib Shaikh Jan 27 '12 at 06:00
  • @ShoaibShaikh yes i have tried this but apart from pdf that are in English it is not working..i gives a blank output. :( – Tushar Agarwal Jan 27 '12 at 07:19
  • i guess you will have to modify pdf parsing algo.. you will have to identify unicode char range and extract selected area.. this is the pdfparser used in the article i have mentioned, you will have to modify it http://www.codeproject.com/Articles/7056/Code-to-extract-plain-text-from-a-PDF-file – Shoaib Shaikh Jan 27 '12 at 07:33

3 Answers3

0

If you are on windows, you can use the IFilter that adobe provides. Me, I used the IFilter adobe provides with the adobe reader 8. Here is a link to the exact example I used

http://www.codeproject.com/Articles/13391/Using-IFilter-in-C

The performance was okay (I think. I haven't used many other methods). Takes about 15 sec for a 400 page PDF.

0

Personally if you want it done right you have to pay for it. ComponentOne has a PDFViewer for WPF. Not sure what framework your working with since your tag is missing one.

ComponentOne PDF Viewer for WPF

MyKuLLSKI
  • 5,285
  • 3
  • 20
  • 39
0

You might want to try the trial version of Quick PDF Library to see how it performs on your documents. http://www.quickpdflibrary.com

QP.GetPageText(7) or GetPageText(8) returns pretty good results for most PDF files.

Andrew.

Disclaimer: I do some consulting work for Quick PDF Library.

Andrew Cash
  • 2,321
  • 1
  • 17
  • 11