Using pdfinfo
from poppler-utils you can say,
find . -type f -iname '*.pdf' | while read -r pn
do pdfinfo "$pn" |
grep -q '^Form: *AcroForm' && printf '%s\n' "$pn"
done
to list the names of PDF files for which pdfinfo
reports:
Form: AcroForm
However, in my tests it misses several PDFs with text annotations
and lists several without so I'd avoid it for this job. Below are 2
alternatives: qpdf
supports all annotation subtypes,
python3-poppler-qt5
only a subset but can be much faster.
(For a non-POSIX shell adapt the commands in this posting.)
EDIT: find
constructs edited to avoid unsafe and GNU-reliant {}
s.
qpdf versions since 8.3.0 support a json representation
of non-content PDF data, and if you're on a system with the jq
JSON processor you can list unique PDF annotation types as
tab-separated values (in this case discarding the output and using
the exit code only):
find . -type f -iname '*.pdf' | while read -r pn
do qpdf --json --no-warn -- "$pn" |
jq -e -r --arg typls '*' -f annots.jq > /dev/null &&
printf '%s\n' "$pn"
done
where
--arg typls '*'
specifies desired annotation subtypes, e.g. *
for all (the default), or Text,FreeText,Link
for a selection
-e
sets exit code 4 if no output was made (no annotations found)
-r
produces raw (non-JSON) output
- the
jq
script file annots.jq
contains the following
#! /usr/bin/env jq-1.6
def annots:
( if ($typls | length) > 0 and $typls != "*"
then $typls
else
# annotation types, per Adobe`s PDF Reference 1.7 (table 8.20)
"Text,Link,FreeText,Line,Square,Circle,Polygon"
+ ",PolyLine,Highlight,Underline,Squiggly,StrikeOut"
+ ",Stamp,Caret,Ink,Popup,FileAttachment,Sound,Movie"
+ ",Widget,Screen,PrinterMark,TrapNet,Watermark,3D"
end | split(",")
) as $whitelist
| .objects
| .[]
| objects
| select( ."/Type" == "/Annot" )
| select( ."/Subtype" | .[1:] | IN($whitelist[]) )
| ."/Subtype" | .[1:]
;
[ annots ] | unique as $out
| if ($out | length) > 0 then ($out | @tsv) else empty end
For many purposes it's tempting to use python-3.x with
python3-poppler-qt5
to handle the entire file list in one go,
find . -type f -iname '*.pdf' -exec python3 path/to/script -t 1,7 {} '+'
where the -t
option lists the desired annotation subtypes, per
poppler documentation;
1 is AText
and 7 is ALink
. Without -t
all subtypes known to
poppler (0 through 14) are selected, i.e. not all existing subtypes
are supported.
#! /usr/bin/env python3.8
import popplerqt5
def gotAnnot(pdfPathname, subtypls):
pdoc = popplerqt5.Poppler.Document.load(pdfPathname)
for pgindex in range(pdoc.numPages()):
annls = pdoc.page(pgindex).annotations()
if annls is not None and len(annls) > 0:
for a in annls:
if a.subType() in subtypls:
return True
return False
if __name__ == "__main__":
import sys, getopt
typls = range(14+1) ## default: all subtypes
opts, args = getopt.getopt(sys.argv[1:], "t:")
for o, a in opts:
if o == "-t" and a != "*":
typls = [int(c) for c in a.split(",")]
for pathnm in args:
if gotAnnot(pathnm, typls):
print(pathnm)