-1

I attempted to follow Getting PDF Version using Python to extract the version from a PDF file and unfortunately resulted in an error code.

I'm new to Python and have no idea how to fix this. I can view the PDF file in something like Notepad and see that it is the first line and something like %PDF-1.4 but don't know how to extract it.

The code I used was as follows:

from PyPDF2 import PdfReader
doc = PdfReader(filepath)
doc.stream.seek(0) # Necessary since the comment is ignored for the PDF analysis
print(doc.stream.readline().decode())

I expected the result to be: %PDF-1.4

I received the error code: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte

Marcin Orlowski
  • 72,056
  • 11
  • 123
  • 141
Neil
  • 3
  • 2
  • Well, there is no need to create a PdfReader here. You're just reading the first line of a file which happens to be a PDF. So you can use the ordinary file reading facilities of Python for this. Not sure why you would want to, though - PyPDF2 almost certainly has a function to return the PDF version. (In any event, a PDF header can be preceded by up to 1024 bytes of junk, so your code isn't going to work for all PDFs anyway). – johnwhitington May 04 '23 at 19:54

1 Answers1

0

PyPDF2 is deprecated. Use pypdf:

from pypdf import PdfReader

reader = PdfReader("example.pdf")
print(reader.pdf_header)

gives something like '%PDF-1.4'.

However, I don't see any reason to do that. PDF Generators typically just set a constant within the file. Meaning this version can be wrong in two directions:

  • The PDF documents header claims the document is in version X, but actually uses newer features (that should be rarely the case). Meaning it should give version y with x < y.
  • The PDF documents header claims the document is in version X, but actually uses only older features. That means it could denote an older version y with x > y.

I don't know if any software makes use of the version being denoted.

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958