I am trying to read the below PDF programmatically using Python to extract useful information.
Here, the "attachments" are basically links that point to specific pages inside the same PDF. I came to know that these are called "annots" and there is a way to extract them using PyPDF2
library.
My end goal is to read the whole PDF attachment by attachment where each attachment could span across multiple pages. I've tried below:
# creating a pdf reader object
pdfReader: PdfReader = PdfReader(pdfFileObj)
# Read annots from pdf
start = 0
end = 2
while start < end:
try:
for annot in pdfReader.pages[start]["/Annots"]:
print(annot.getObject()) # (1)
print("")
except:
# there are no annotations on this page
pass
I was hoping that annot.getObject()
or something like annot.extract_text()
would give me the full content of the relevant pages but it is not so. Going through annot
object doesn't provide any useful information.