pyPdf unable to extract text from some pages in my PDF

Question

I'm trying to use pyPdf to extract and print pages from a multipage PDF. Problem is, text is not extracted from some pages. I've put an example file here:

http://www.4shared.com/document/kmJF67E4/forms.html

If you run the following, the first 81 pages return no text, while the final 11 extract properly. Can anyone help?

from pyPdf import PdfFileReader  
input = PdfFileReader(file("forms.pdf", "rb"))  
for page in input1.pages:  
    print page.extractText()

score 10 · Answer 1 · answered Nov 17 '10 at 11:04

10

Note that extractText() still has problems extracting the text properly. From the documentation for extractText():

This works well for some PDF files, but poorly for others, depending on the generator used. This will be refined in the future. Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated.

Since it is the text you want, you can use the Linux command pdftotext.

To invoke that using Python, you can do this:

>>> import subprocess
>>> subprocess.call(['pdftotext', 'forms.pdf', 'output'])

The text is extracted from forms.pdf and saved to output.

This works in the case of your PDF file and extracts the text you want.

answered Nov 17 '10 at 11:04

user225312

126,773
69
172
181

Thanks for your help. I'd tried pdftotext and passed it over as it only partially solves the problem. I need to split the pdf into separate files on the basis of UID's which are found on each page. However the last 10 or so pages, which pyPdf can extract, don't have textual page labels, so using pdftotext, while it gives me all the text, doesn't give me a way of generating a list of pages for a given UID. – DrJAKing Nov 17 '10 at 11:26
This doesn't do a bad job of outputting the PDF's text, but does not preserve table formatting. – s2t2 Jul 13 '17 at 19:50

score 2 · Answer 2 · answered Jan 27 '11 at 13:48

This isn't really an answer, but the problem with pyPdf is this: it doesn't yet support CMaps. PDF allows fonts to use CMaps to map character IDs (bytes in the PDF) to Unicode character codes. When you have a PDF that contains non-ASCII characters, there's probably a CMap in use, and even sometimes when there's no non-ASCII characters. When pyPdf encounters strings that are not in standard Unicode encoding, it just sees a bunch of byte code; it can't convert those bytes to Unicode, so it just gives you empty strings. I actually had this same problem and I'm working on the source code at the moment. It's time consuming, but I hope to send a patch to the maintainer some time around mid-2011.

score 1 · Answer 3 · answered Nov 17 '10 at 12:26

1

You could also try the pdfminer library (also in python), and see if it's better at extracting the text. For splitting however, you will have to stick with pyPdf as pdfminer doesn't support that.

answered Nov 17 '10 at 12:26

Steven

28,002
5
61
51

I have tried pdfminer... the latter pages don't get extracted properly for some reason. – DrJAKing Nov 17 '10 at 15:25

score 1 · Answer 4 · answered Nov 17 '10 at 13:13

1

I find it sometimes useful to convert it to ps (try with pdf2psand pdftops for potential differences) then back to pdf (ps2pdf). Then try your original script again.

answered Nov 17 '10 at 13:13

Danosaure

3,578
4
26
41

I was hopeful, but all it seems to do is make the original file bigger and slow down the extraction of null text! – DrJAKing Nov 17 '10 at 15:23

score 1 · Answer 5 · answered Jan 08 '18 at 10:05

I had similar problem with some pdfs and for windows, this is working excellent for me:

1.- Download Xpdf tools for windows

2.- copy pdftotext.exe from xpdf-tools-win-4.00\bin32 to C:\Windows\System32 and also to C:\Windows\SysWOW64

3.- use subprocess to run command from console:

import subprocess

try:
    extInfo = subprocess.check_output('pdftotext.exe '+filePath + ' -',shell=True,stderr=subprocess.STDOUT).strip()
except Exception as e:
    print (e)

Improvements: shell= False and don't coerce stderr to stdout... you risk to get them mixed up. — Massimo, Jan 04 '20 at 10:11

score 0 · Answer 6 · answered Nov 17 '10 at 15:40

I'm starting to think I should adopt a messy two-part solution. there are two sections to the PDF, pp 1-82 which have text page labels (pdftotext can extract), and pp 83-end which have no page labels but pyPDF can extract and it explicitly knows pages.

I think I need to combine the two. Clunky, but I don't see any way round it. Sadly I'm having to do this on a Windows machine.

pyPdf unable to extract text from some pages in my PDF

6 Answers6

Linked