0

My Python3 script sits on a webserver and receives a pdf-file sent to it via internet. So, the pdf-file exists already in RAM as the content of a variabel which is a bytesstring:

pdf_content = b'<placeholder for the entire pdf-document>'

Soon after receiving the file it will be stored on the server's hard drive:

with open('path/to/file.pdf', 'wb') as writer:
    writer.write(pdf_content)

But I also want to get the number of pages of the pdf-file:

num_pages = get_num_pages(pdf_content)

Here is my question:

What is the fastest and most reliable method of getting the number of pages of a pdf-document that already is in RAM as a bytesstring?

With other words: How to fill the body of this function?

def get_num_pages(pdf_content):
    # do something
    return num_pages

This is fast but unreliable:

The following solution uses a regular expression to find all occurrences of the string /Page. Then it returns the number of these findings. (The variable findings is an array.)

def get_num_pages(pdf_content):
    findings = re.findall(b'\/Page\W', pdf_content)
    return len(findings)

The problem with this solutions is, that not all pdf documents contain an instances of the string /Page on every page. My own master thesis is 101 pages long, but it nowhere contains the string /Page. So this functions says that my master thesis is 0 pages long, which is wrong.

My master thesis document was created by a LaTeX editor and can be opened with any pdf-reader. If you want to inspect it, you can download it from here.


This works correctly but is slow:

# use PyPDF2
from PyPDF2 import PdfFileReade

# receive the content via internet
pdf_content = b'<placeholder for the entire pdf-document>'

# write it to the hard disk (this is what I want to do anyway):
with open('path/to/file.pdf', 'wb') as writer:
    writer.write(pdf_content)

# read it again from hard disk (this is the problem):
pdf = PdfFileReader(open('path/to/file.pdf','rb'))

# retrieve the number of pages:
num_pages = pdf.getNumPages()

This solutions gives the correct number of pages, even for my master thesis, but it requires to read the content from hard disk which is slow. As said before, the content is already completely in the RAM, as a bytestring. But PdfFileReader doesn't accept a bytestring as argument. This gives an error:

pdf_content = b'<placeholder for the entire pdf-document>'
pdf = PdfFileReader(pdf_content)

---
Traceback (most recent call last):
  File "./test.py", line 20, in <module>
    pdf = PdfFileReader(pdf_content)
  File "/usr/local/lib/python3.6/dist-packages/PyPDF2/pdf.py", line 1084, in __init__
    self.read(stream)
  File "/usr/local/lib/python3.6/dist-packages/PyPDF2/pdf.py", line 1689, in read
    stream.seek(-1, 2)
AttributeError: 'bytes' object has no attribute 'seek'

And the documentation of PyPDF2 doesn't list a method to convert a plain bytesstring into an object that provides the seek-method.

So, here again is my question: What is the fastest and most reliable method of getting the number of pages of a pdf-document that already is in RAM as a bytesstring?

Hubert Schölnast
  • 8,341
  • 9
  • 39
  • 76
  • Check out this other SO question and answer for how to use `io.BytesIO` for your purpose: https://stackoverflow.com/a/47801913/42346 – mechanical_meat May 08 '21 at 17:18
  • if `PdfFileReader()` gets `open()` then you can use `PdfFileReader(io.BytesIO(pdf_content))` – furas May 08 '21 at 19:19

1 Answers1

1

If some function works with file handler created by open()

 handler = open(...)
 PdfFileReader(handler)

then it can work with file-like object created by io.BytesIO() or io.StringIO()

 handler = io.BytesIO(pdf_content)
 PdfFileReader(handler)

Doc: io

furas
  • 134,197
  • 12
  • 106
  • 148