Extract text based on annots from PDF using Python and PyPDF2

Question

I am trying to read the below PDF programmatically using Python to extract useful information.

Here, the "attachments" are basically links that point to specific pages inside the same PDF. I came to know that these are called "annots" and there is a way to extract them using PyPDF2 library.

My end goal is to read the whole PDF attachment by attachment where each attachment could span across multiple pages. I've tried below:

# creating a pdf reader object
pdfReader: PdfReader = PdfReader(pdfFileObj)

# Read annots from pdf
start = 0
end = 2
while start < end:
    try:
        for annot in pdfReader.pages[start]["/Annots"]:
            print(annot.getObject())  # (1)
            print("")
    except:
        # there are no annotations on this page
        pass

I was hoping that annot.getObject() or something like annot.extract_text() would give me the full content of the relevant pages but it is not so. Going through annot object doesn't provide any useful information.

@KJ that's the first thing I did as I was 99% sure these are just a bunch of attachments bundled together. However, that's not the case sadly: https://github.com/py-pdf/pypdf/issues/1645#issuecomment-1437876996 I'm at the point of scratching my last few remaining hairs. I don't want to end up writing a hacky spaghetti half-perfect code. — saran3h, Feb 22 '23 at 04:43

K J · Answer 1 · 2023-02-25T05:04:03.457

From the sample the annot attachments symbols are a red herring, It looks like they were attached in the past but then flattened into the whole PDF. A secondary issue adding to annotation confusion, is that many of the 409 internal and external linked filenames when downloaded did not agree between internal and external nomenclature, usually a sign of poor merging. There are 372 different Fonts included thus showing, one file has ben merged from many. Posibly had been a former "portfolio" composed of different attached files, but now simply appended as pages.

From either the field data file or the pdf we can see which is the first page to extract for each attachment type C:\Apps\PDF\2023-02-09-04-31-12.fdf |find "/D ["|more
so the answer DESTINATIONS start off 2,4,6,8,9 thus some are 2 pages some are 1 some may be many !

/D [2 /FitH 10000]
/D [2 /FitH 10000]
/D [2 /FitH 10000]
/D [4 /FitH 10000]
/D [4 /FitH 10000]
/D [4 /FitH 10000]
/D [6 /FitH 10000]
/D [6 /FitH 10000]
/D [6 /FitH 10000]
/D [8 /FitH 10000]
/D [8 /FitH 10000]
/D [8 /FitH 10000]
/D [9 /FitH 10000]
/D [9 /FitH 10000]
/D [9 /FitH 10000]

Extract text based on annots from PDF using Python and PyPDF2

1 Answers1