0

I was surprised when I printed a pdf which I annotated with Okular that print was without the annotations eventhough it does show on the screen. I have to save the annoted file as printed pdf, then print it.

Question: how can I list all pdfs having at least one annotation on at least one page?

Apparently, pdfinfo returns Acroform when there is an annotation

            find -type f -iname "*.pdf" -exec pdfinfo {} \;

but does not displays the filename.

I'm not familiar with qpdf, but it does not seem to provide this info

Thanks

user2718593
  • 111
  • 8

1 Answers1

1

Using pdfinfo from you can say,

find . -type f -iname '*.pdf' | while read -r pn
do  pdfinfo "$pn" |
    grep -q '^Form: *AcroForm' && printf '%s\n' "$pn"
done

to list the names of PDF files for which pdfinfo reports:

Form:           AcroForm

However, in my tests it misses several PDFs with text annotations and lists several without so I'd avoid it for this job. Below are 2 alternatives: qpdf supports all annotation subtypes, python3-poppler-qt5 only a subset but can be much faster.

(For a non-POSIX shell adapt the commands in this posting.)

EDIT: find constructs edited to avoid unsafe and GNU-reliant {}s.


versions since 8.3.0 support a representation of non-content PDF data, and if you're on a system with the JSON processor you can list unique PDF annotation types as tab-separated values (in this case discarding the output and using the exit code only):

find . -type f -iname '*.pdf' | while read -r pn
do  qpdf --json --no-warn -- "$pn" |
    jq -e -r --arg typls '*' -f annots.jq > /dev/null && 
    printf '%s\n' "$pn"
done

where

  • --arg typls '*' specifies desired annotation subtypes, e.g. * for all (the default), or Text,FreeText,Link for a selection
  • -e sets exit code 4 if no output was made (no annotations found)
  • -r produces raw (non-JSON) output
  • the jq script file annots.jq contains the following
#! /usr/bin/env jq-1.6
def annots:
    ( if ($typls | length) > 0 and $typls != "*"
      then $typls
      else
        # annotation types, per Adobe`s PDF Reference 1.7 (table 8.20)
        "Text,Link,FreeText,Line,Square,Circle,Polygon"
        + ",PolyLine,Highlight,Underline,Squiggly,StrikeOut"
        + ",Stamp,Caret,Ink,Popup,FileAttachment,Sound,Movie"
        + ",Widget,Screen,PrinterMark,TrapNet,Watermark,3D"
      end | split(",")
    ) as $whitelist
    | .objects
    | .[]
    | objects
    | select( ."/Type" == "/Annot" )
    | select( ."/Subtype" | .[1:] | IN($whitelist[]) )
    | ."/Subtype" | .[1:]
    ;
[ annots ] | unique as $out
| if ($out | length) > 0 then ($out | @tsv) else empty end

For many purposes it's tempting to use with python3-poppler-qt5 to handle the entire file list in one go,

find . -type f -iname '*.pdf' -exec python3 path/to/script -t 1,7 {} '+'

where the -t option lists the desired annotation subtypes, per poppler documentation; 1 is AText and 7 is ALink. Without -t all subtypes known to poppler (0 through 14) are selected, i.e. not all existing subtypes are supported.

#! /usr/bin/env python3.8
import popplerqt5

def gotAnnot(pdfPathname, subtypls):
    pdoc = popplerqt5.Poppler.Document.load(pdfPathname)
    for pgindex in range(pdoc.numPages()):
        annls = pdoc.page(pgindex).annotations()
        if annls is not None and len(annls) > 0:
            for a in annls:
                if a.subType() in subtypls:
                    return True
    return False

if __name__ == "__main__":
    import sys, getopt
    typls = range(14+1)         ## default: all subtypes
    opts, args = getopt.getopt(sys.argv[1:], "t:")
    for o, a in opts:
        if o == "-t" and a != "*":
            typls = [int(c) for c in a.split(",")]
    for pathnm in args:
        if gotAnnot(pathnm, typls):
            print(pathnm)
urznow
  • 1,576
  • 1
  • 4
  • 13