0

I am trying to read some text from a pdf file. I am using the code below however when I try to get the text (ptext) all that is return is a string variable of size 1 & its empty.

Why is no text being returned? I have tried other pages and another pdf book but the same thing, I can't seem to read any text.

import PyPDF2

file = open(r'C:/Users/pdfs/test_file.pdf', 'rb')
fileReader = PyPDF2.PdfFileReader(file)

pageObj = fileReader.getPage(445)
ptext = pageObj.extractText()
mHelpMe
  • 6,336
  • 24
  • 75
  • 150
  • 1
    Do these PDFs contain text? Check that – do not say "of course, because I can see it!" – Jongware Feb 22 '20 at 11:35
  • 1
    From the [extractText docs](https://pythonhosted.org/PyPDF2/PageObject.html#PyPDF2.pdf.PageObject.extractText): "*This works well for some PDF files, but poorly for others, depending on the generator used.*". I've never had any success with PyPDF2 (especially with PDFs generated from MS Office). Try the alternatives here: [How to extract text from a PDF file?](https://stackoverflow.com/q/34837707/2745495). – Gino Mempin Feb 22 '20 at 11:36
  • @usr2564301 stupid question here but how do I know if it contains text? I mean I can see words but guess that could be a scanned image? – mHelpMe Feb 22 '20 at 11:43
  • 2
    (1) Open with a canonical PDF reader such as Adobe's own. (2) Select text – if there is no text this step will fail. (3) Copy, paste into a text editor. If the text cannot be decoded, you get nothing or garbage. – Jongware Feb 22 '20 at 11:44
  • Have a look at [pdfreader](http://pdfreader.readthedocs.io/) – Maksym Polshcha Feb 23 '20 at 02:26
  • @MaksymPolshcha thanks just tried installing pdfreader. Have you come across a problem with it being unable to uninstall bitarray. it is a disutils installed project and thus we cannot accurately determine which files belong to it which would only lead to a partial unistall – mHelpMe Feb 23 '20 at 11:14
  • @mHelpMe Nope I have not. I always use it with virtualenv. You can submit an issue here http://github.com/maxpmaxp/pdfreader/issues – Maksym Polshcha Feb 23 '20 at 15:16

2 Answers2

1

I also had the same issue, I thought something was wrong with my code or whatnot. After some intense researching, debugging and investigation, it seems that PyPDF2, PyPDF3, PyPDF4 packages cant handle large files... Yes, I tried with a 20 page PDF, ran seamlessly, but put in a 50+ page PDF, and PyPDF crashes.

My only suggestion would be to use a different package altogether. pdftotext is a good recommendation. Use pip install pdftotext.

Alister Baroi
  • 53
  • 1
  • 8
1

I have faced a similar issue while reading my pdf files. Hope the below solution helps. The reason why I faced this issue : The pdf I was selecting was actually a scanned image. I created my resume using a third party site which returned me a pdf. On parsing this type of file, I was not able to extract text directly.

Below is the testes working code

from PIL import Image
import pytesseract
from pdf2image import convert_from_path
import os
  
def readPdfFile(filePath):  
    pages = convert_from_path(filePath, 500)
    image_counter = 1
    #Part #1 : Converting PDF to images
    for page in pages:
        filename = "page_"+str(image_counter)+".jpg"
        page.save(filename, 'JPEG')
        image_counter = image_counter + 1
        
    #Part #2 - Recognizing text from the images using OCR
    filelimit = image_counter-1 # Variable to get count of total number of pages
  
    for i in range(1, filelimit + 1):
        filename = "page_"+str(i)+".jpg"
        text = str(((pytesseract.image_to_string(Image.open(filename)))))
        text = text.replace('-\n', '')    

    #Part 3 - Remove those temp files
    image_counter = 1
    for page in pages:
        filename = "page_"+str(image_counter)+".jpg"
        os.remove(filename)
        image_counter = image_counter + 1
    return text