2

I am trying to extract comments from a PDF using Python. These are the two pieces of code that I have tested:

One using PyPDF2:

import PyPDF2

src = 'xxxx.pdf'
input1 = PyPDF2.PdfFileReader(open(src, "rb"))
nPages = input1.getNumPages()

df_comments = pd.DataFrame()
for i in range(nPages) :
    annotation = []
    page = []
    page0 = input1.getPage(i)
    try :
        for annot in page0['/Annots'] :
            annotation.append(annot.getObject())
        page = [i+1] * len(annotation)
        page = pd.DataFrame(page)
        annotation = pd.DataFrame(annotation)
        df_temp = pd.concat([page, annotation], axis=1)
        df_comments = pd.concat([df_comments, df_temp], ignore_index=True)
    except : 
        # there are no annotations on this page
        pass

and the other using fitz:

import fitz
doc = fitz.open(src)
for i in range(doc.pageCount):
    page = doc[i]
    for annot in page.annots():
        print(annot.info)

The comments are getting extracted, however when I check the PDF I see that the comments are not being extracted sequentially. I have tried to check other parameters like creation date, modification date but that is not helping me.

Is their a way I can extract them serially as they are appearing in the PDF? Or Can I extract the text as well from the PDF against which the comment has been tagged?

Debadri Dutta
  • 1,183
  • 1
  • 13
  • 39
  • 3
    A PDF is instructions to a printer for how to print a document. It's not structured for easy scraping. It doesn't say "here's a table of values, print them", "here's a header and footer, print them". It's just a mass of drawing instructions for lines, glyphs and bitmaps. The fact that PyPDF2 is able to achieve anything at all is almost a miracle. Whatever you manage to get out, it's up to you to make sense of it. – Peter Wood Jul 06 '21 at 07:41
  • 1
    What does "sequentially" mean? By the time they were added to the document? By "reading flow", so the position in the document? By the last time the annotation was edited? – Martin Thoma Jul 30 '22 at 10:32

1 Answers1

1

I'm the current maintainer of PyPDF2.

The annotations are currently extracted in the order they appear in the annotations dictionary.

If you have a sensible way to sort them, feel free to open a feature request in the PyPDF2 issue tracker on github.

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958