3

i'm working on a script that will extract data from a large PDF File (40-60 plus, pages long) that isn't in English but the file contains Greek characters and all seems good until i run the extractText() function of PyPDF2 to get the givens page contents, then it returns an empty string.

I'm new to this library and i don't know what to do, to fix this problem!!

gemgr
  • 55
  • 7

1 Answers1

1

PyPDF2's "Extract Text" looks like it will either Work Just Fine, or Fail Completely. There's no parameters you can pass in to try to get things to work properly. It'll work or it won't.

You may not be able to fix this problem. If you can successfully copy/paste the text in Acrobat/Reader, then it's possible to extract the text. So what happens when you try to copy/paste out of Reader? Don't try this with some other third party PDF viewer, use Adobe software. You'll probably have to abandon PyPDF2 and move on to some other PDF API, but if Reader can do it, it's a fixable problem.

There are three different things in a PDF that can look like letters to the human eye.

  1. Letters in the PDF in some text encoding. There are several fixed encodings, plus PDF allows you to embed your own custom encodings (often used with font subsets). Software can create PDFs that look fine but can't really be copy/pasted from, even by Adobe.
  2. Path art that just happens to look an awful lot like letters. "Start drawing a line here, draw a straight line to there, then a curve like this to there" and so on. If you're curious, PDF uses Bezier curves to define its curves. Not terribly related to your question, but interesting.
  3. Bit maps (.jpeg/gif/etc images) that define a grid of pixels.

In the past, Reader has only been able to handle text type 1 above, and then only if the text was encoded properly. Broken custom encodings are alarmingly common (or were 7+ years ago when I stopped working on PDF software).

With broken type 1s, and all of 2 and 3, the only thing you can do is to run OCR on the PDF. OCR: Optical Character Recognition. There are several open source OCR projects out there, as well as commercial ones.

Mark Storer
  • 15,672
  • 3
  • 42
  • 80
  • Thank you so much for your answer, interesting stuff. I haven't tried to to this in Adobe Software, but as soon as i will do that i will let you know if i can copy/paste the text from the file. – gemgr Feb 24 '20 at 14:13
  • I was able to copy and paste the text from the original PDF file, as well as from the test/dummy PDF file i use for testing. In both cases i was able to do it with Adobe Reader 9 (Version 9.5.5 04/26/2013) on a Linux Machine & Adobe Acrobat Reader DC (Version 2020.006.20034) on a Windows 10 Machine. So i'm very curious to see what the problem is. – gemgr Feb 24 '20 at 15:50
  • **Update** looks like the problem is with the Greek Characters, as soon as i edited the test/dummy PDF File to include some English Characters and then search for that Keyword it works fine, with no Greek characters showing up on my debug print statement i put in place to see what data tries to extract, just that keyword in English shows up. – gemgr Feb 24 '20 at 16:24
  • It sounds like PyPDF2.extractText() doesn't support some encoding types, and your file uses one (or more) of them. If you really want to know whats going in on there, you'll need to `print(page.getContents())`, download a copy of the PDFSpec (google "PDF Spec"), and have a look at chapter 9.10 "Extraction of Text Content" – Mark Storer Feb 24 '20 at 17:54
  • Ok, thank you very much for your help. I will go for the OCR approach, as i already started implementing that! Any suggestions for a good library would be nice. With a little help i found the **pytesseract** library and started with that!!! – gemgr Feb 24 '20 at 19:07
  • The library I'm most familiar with is a Java/C# library called iText. I haven't done any python/PDF development at all (and started working with python less than a month ago). Sorry I couldn't be of more help. – Mark Storer Feb 25 '20 at 20:31