0

I am converting pdf files to text using iTextSharp, however I found that if a PDF has embedded fonts or OpenType fonts, I cannot get the text from the PDF. Is there solution for this? I just need to convert to text. Any help is appreciated. Thanks!

Drew
  • 2,601
  • 6
  • 42
  • 65

2 Answers2

3

As someone who processes thousands of random PDFs from all sorts of diverse clients each month, XpdfText is by far the best library for extracting text, in my experience. We also use iTextSharp for various tasks, but haven't found it nearly as good for extracting text.

Mark S. Rasmussen
  • 34,696
  • 4
  • 39
  • 58
  • good call, but it's important to note that there will be errors. There is no perfect OCR library out there. – deltree Mar 21 '12 at 18:14
  • This is not using OCR. As long as the fonts are embedded the source text can be extracted. OCR is only needed if the PDF contains non-system fonts that are only embedded as glyphs, or if the text is embedded in image form. – Mark S. Rasmussen Mar 21 '12 at 18:24
  • Thanks Mark! I believe this is what I am looking for. They don't have a trial download on their site, hopefully I can give it a shot before purchasing. – Drew Mar 21 '12 at 18:39
  • This one too, returns illegible characters if the pdf has embedded fonts. I am wondering if OCR will be the answer after all, just need to find a reliable library. It would be a pain but perhaps I will need to convert to an image and then get the text that way. – Drew Mar 23 '12 at 12:09
  • 1
    Are you absolutely sure the PDF contains embedded fonts for the text you're looking at? You get the blocky/weird characters in the case that the fonts are not embedded. Embedded fonts are used to map the visual display to copyable characters, not the actually display the font. – Mark S. Rasmussen Mar 23 '12 at 13:38
0

Short answer

Most probably the files are not produced with enough information for proper text extraction.

Please have a look at my longer answer for a somewhat related question.

Community
  • 1
  • 1
Bobrovsky
  • 13,789
  • 19
  • 80
  • 130