Can't read the content of a certain page of a pdf file available online

Question

I've used PyMuPDF library to parse the content of any specific page of a pdf file locally and found it working. However, when I try to apply the same logic while parsing the content of any specific page of a pdf file available online, I encounter an error.

I got success using the following script (local pdf):

import fitz

path = r'C:\Users\WCS\Desktop\pymupdf\Regular Expressions Cookbook.pdf'

doc = fitz.open(path)
page1 = doc.loadPage(5)
page1text = page1.getText("text")
print(page1text)

The script below throws an error (pdf that is available online):

import fitz
import requests

URL = 'https://buildmedia.readthedocs.org/media/pdf/pdfminer-docs/latest/pdfminer-docs.pdf'

res = requests.get(URL)
doc = fitz.open(res.content)
page1 = doc.loadPage(5)
page1text = page1.getText("text")
print(page1text)

Error that the script encounters:

Traceback (most recent call last):
  File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\general_demo.py", line 8, in <module>
    doc = fitz.open(res.content)
  File "C:\Users\WCS\AppData\Local\Programs\Python\Python37-32\lib\site-packages\fitz\fitz.py", line 2010, in __init__
    _fitz.Document_swiginit(self, _fitz.new_Document(filename, stream, filetype, rect, width, height, fontsize))
RuntimeError: cannot open b'%PDF-1.5\n%\xd0\xd4\xc5\xd8\n1 0 obj\n<<\n/Length 843       \n/Filter /FlateDecode\n>>\nstream\nx\xdamUMo\xe20\x10\xbd\xe7Wx\x0f\x95\xda\x03\xc5N\xc8W\x85\x90\x9c\x84H\x1c\xb6\xad\nZ\xed\x95&\xa6\x8bT\x12\x14\xe0\xd0\x7f\xbf~3\x13\xda\xae\xf

How can I read the content directly from online?

How did you resolve the issue? I am using `doc = fitz.open(stream=bytes_pdf, filetype="application/pdf")` where `bytes_pdf` output is similar to `b'%PDF-1.4\n1'` but still facing issue — Amit Pathak, Oct 30 '20 at 09:08

score 8 · Accepted Answer · edited Nov 23 '20 at 10:29

8

Looks like you need to initialize the object with stream:

>>> # from memory
>>> doc = fitz.open(stream=mem_area, filetype="pdf")

mem_area has the data of the document.

https://pymupdf.readthedocs.io/en/latest/document.html#Document

edited Nov 23 '20 at 10:29

LoopsGod

371
2
11

answered Aug 16 '19 at 20:59

Sergio Pulgarin

869
8
20

1

Yes, this is exactly it. Thanks a lot @Sergio Pulgarin. – MITHU Aug 16 '19 at 21:08
2

the docs have moved to here: https://pymupdf.readthedocs.io/en/latest/document.html#Document – chia berry Aug 11 '20 at 21:08
Does `mem_area` correspond to `res.content` in the OP question ? – Takamura Nov 28 '22 at 10:52
Yes, the data bytes. – Sergio Pulgarin Feb 28 '23 at 06:29

score -1 · Answer 2 · answered May 13 '22 at 17:59

-1

I think you were missing the read() function to read file as bytesIO which pymupdf can then consume.

with fitz.open(stream=uploaded_pdf.read(), filetype="pdf") as doc:
    text = ""
    for page in doc:
        text += page.getText()
    print(text)

answered May 13 '22 at 17:59

AKASH GUDADHE

317
1
6
15

Can't read the content of a certain page of a pdf file available online

2 Answers2