-1

I am trying to extract annotations from a PDF and then use that data to 'Cherry Pick' the annotations we require to display them in a clean version of the PDF using the Adobe Embed API. We are getting data fine from the PDF using PyMuPDF however when we apply the annotation to the clean PDF we are getting weird results.

First parse the Annotated PDF - Sandborn2003Annotated.pdf

Then using the Adobe Document Cloud Embed API - https://developer.adobe.com/document-services/docs/overview/pdf-embed-api/ apply the first Annotation in the stored json to a clean version of the file Sandborn2003.pdf


*** Edit *** This is the outputted file Sandborn 2003 (1).pdf


I am expecting to see the clean PDF take on the same annotation as the source

Screenshots of Before (Source) and After (Target) Source Annotated Document screenshot_2023-03-29_at_20 59 45_720

Target PDF after applying the annotation screenshot_2023-03-29_at_20 59 45_720

My Environment

  • macOs Ventura 13.2.1
  • Python 3.10.9
  • PyMuPDF version - 1.21.1

This my code

import fitz
import json
import sys
from datetime import datetime

if len(sys.argv) != 2:
    print('Usage: python extractPDFAnnotations.py <filename>')
    sys.exit(1)
filename = sys.argv[1]

doc = fitz.open('Sandborn2003Annotated.pdf')
annotations = []

for page_num, page in enumerate(doc):
    for annot in page.annots():
        annotation_data = {}
        target_data = {}
        selector_data = {}

        # Common properties for all annotation types
        annotation_data["@context"] = [
            "https://www.w3.org/ns/anno.jsonld",
            "https://comments.acrobat.com/ns/anno.jsonld",
        ]
        annotation_data["id"] = annot.info.get("id", "")
        annotation_data["type"] = "Annotation"
        annotation_data["motivation"] = "commenting"
        annotation_data["bodyValue"] = annot.info.get("content", "")

        # Target properties
        # Replace this with the actual source identifier
        target_data["selector"] = selector_data
        annotation_data["target"] = target_data

        # Selector properties
        selector_data["node"] = {"index": page_num}
        selector_data["type"] = "AdobeAnnoSelector"

        if annot.type[0] == 8:  # Highlight annotation
            all_coordinates = annot.vertices
            highlights = []

            for i in range(0, len(all_coordinates), 4):
                quad = all_coordinates[i:i + 4]
                highlight_coord = fitz.Quad(quad).rect
                highlights.extend(
                    [highlight_coord.x0, highlight_coord.y0, highlight_coord.x1, highlight_coord.y1])

            selector_data["quadPoints"] = highlights
            # Adjust the opacity value as needed
            selector_data["opacity"] = 0.4
            selector_data["subtype"] = "highlight"
            selector_data["boundingBox"] = [
                annot.rect.x0,
                annot.rect.y0,
                annot.rect.x1,
                annot.rect.y1,
            ]
            # Adjust the stroke color as needed
            selector_data["strokeColor"] = "#fccb00"
            # Adjust the stroke width as needed
            selector_data["strokeWidth"] = 3

        # Creator properties
        annotation_data["creator"] = {
            # Replace this with the actual creator's name
            "name": annot.info.get("title", ""),
            "type": "Person",
        }

        annotations.append(annotation_data)

# Save the annotations to a JSON file
with open(filename, 'w') as f:
    json.dump(annotations, f, indent=4)

print(f'Annotations saved to {filename}')
Justin Erswell
  • 688
  • 7
  • 42
  • 87
  • Can you provide the output file with the re-added annotation? – johnwhitington Mar 29 '23 at 19:03
  • First observation: the output annotation QuadPoints appear to be flipped vertically, as well as at the wrong vertical position. So it is as if the whole page has been flipped vertically. This might make sense if the page had /Rotate 180 and this was not being taken account of, but the PDF is all /Rotate 0. Odd. – johnwhitington Mar 29 '23 at 19:05
  • @johnwhitington I have edited the question and added the file – Justin Erswell Mar 29 '23 at 19:19

1 Answers1

1

The original annotation is:

{
      "/Veeva.Vault.Annot": {
        "U": "{\"instanceId\":1296,\"docVersionId\":223360,\"annotateKeyDate\":\"2020-06-30\",\"annotateKeyCode\":\"eM3jyxyu\",\"noteId\":\"1593513852533\",\"userId\":4744685}"
      },
      "/F": { "I": 128 },
      "/C": 155,
      "/Contents": {
        "U": "Anchor Name: Definition of fistula - Alofisel Scientific Communications Platform (2.0) p.2, Alofisel Scientific Communications Platform (2.0) p.8, Alofisel Scientific Platform (0.7) p.2, Alofisel Scientific Platform - Pillar 1 - Unmet needs and burden of disease (1.0) p.2, Alofisel Scientific Platform (0.6) p.8, Alofisel Scientific Platform (0.5) p.2, Alofisel Scientific Platform (0.5) p.8, Alofisel Scientific Communications Platform (0.10) p.8, Alofisel Scientific Platform (0.7) p.8, Voiceover script for animation (0.4) p.1, Alofisel Scientific Platform - Pillar 1 (0.3) p.2, Alofisel Scientific Communications Platform (1.0) p.2, Voiceover script for animation (0.2) p.1, Alofisel Scientific Communications platform - glossary (2.0) p.8, Alofisel Scientific Communications Platform (1.0) p.8, Alofisel Scientific Platform (0.3) p.2, Voiceover script for animation (0.5) p.1, Alofisel Scientific Platform - Pillar 1 - Unmet needs and burden of disease (1.0) p.2, Darvadstrocel_RWE-Daten_22 (0.2) p.5, Darvadstrocel_RWE-Daten_22 (1.0) p.5, Alofisel Scientific Platform (0.6) p.2, Crohn's perianal fistulas - Pathophysiology animation for congress use (1.0) p.1, Alofisel Scientific Platform (0.8) p.8, Voiceover script for animation (0.5) p.1, Alofisel Scientific Platform (0.8) p.2, Voiceover script for animation (1.0) p.1, Alofisel Scientific Platform (0.4) p.8, Voiceover script for animation (1.0) p.1, Alofisel Scientific Platform - Pillar 1 (0.4) p.2, Alofisel Scientific Platform (0.2) p.2, Crohn's perianal fistulas - Pathophysiology animation for congress use (0.2) p.1, Alofisel Scientific Platform (0.4) p.2, Darvadstrocel_RWE-Daten_22 (0.3) p.5, Voiceover script for animation (0.3) p.1, Voiceover script for animation (0.4) p.1, Alofisel Scientific Platform - Pillar 1 (0.2) p.2, Alofisel Scientific Platform (0.2) p.5, Alofisel Scientific Platform (0.9) p.8, Alofisel Scientific Communications Platform (0.10) p.2, Alofisel Scientific Platform - Pillar 1 (0.4) p.2, Alofisel Scientific Platform (0.3) p.5, Alofisel Scientific Platform - Pillar 1 (0.3) p.2, Alofisel Scientific Platform (0.9) p.2"
      },
      "/M": { "U": "D:20200630104413+00'00'" },
      "/NM": { "U": "1593513852533" },
      "/CA": { "F": 1.0 },
      "/Subj": {
        "U": "A perianal ï¬stula (Latin for pipe ) is a chronic track of granulation tissue connecting 2 epithe- lial lined surfaces .1"
      },
      "/T": { "U": "Kate Herring" },
      "/Rotate": { "I": 0 },
      "/Rect": { "F": 315.676 }, { "F": 282.183 }, { "F": 394.806 }, { "F": 291.863 },
      "/QuadPoints":    [
      { "F": 396.603 },
      { "F": 317.863 },
      { "F": 555.722 },
      { "F": 317.863 },
      { "F": 396.603 },
      { "F": 308.183 },
      { "F": 555.722 },
      { "F": 308.183 },
      { "F": 315.676 },
      { "F": 304.863 },
      { "F": 555.7329999999999 },
      { "F": 304.863 },
      { "F": 315.676 },
      { "F": 295.183 },
      { "F": 555.7329999999999 },
      { "F": 295.183 },
      { "F": 315.676 },
      { "F": 291.863 },
      { "F": 394.806 },
      { "F": 291.863 },
      { "F": 315.676 },
      { "F": 282.183 },
      { "F": 394.806 },
      { "F": 282.183 }
    ]

      "/Subtype": { "N": "/Highlight" },
      "/Type": { "N": "/Annot" }
    }

The re-added annotation is:

    {
      "/T": { "U": "Kate Herring" },
      "/CA": { "I": 1 },
      "/CreationDate": { "U": "D:20230329191605Z00'00" },
      "/QuadPoints": [
        { "F": 387.666657 },
        { "I": 483 },
        { "I": 547 },
        { "I": 483 },
        { "F": 387.666657 },
        { "I": 493 },
        { "I": 547 },
        { "I": 493 },
        { "I": 307 },
        { "F": 496.333344 },
        { "I": 547 },
        { "F": 496.333344 },
        { "I": 307 },
        { "F": 505.666657 },
        { "I": 547 },
        { "F": 505.666657 },
        { "I": 307 },
        { "I": 509 },
        { "F": 385.666657 },
        { "I": 509 },
        { "I": 307 },
        { "I": 519 },
        { "F": 385.666657 },
        { "I": 519 }
      ],
      "/AP": { "/N": 351 },
      "/Popup": { "I": 1 },
      "/C": [ { "F": 0.972549 }, { "F": 0.819608 }, { "F": 0.278431 } ],
      "/Rect": [
        { "F": 306.6875 },
        { "F": 482.6875 },
        { "F": 547.3125 },
        { "F": 519.3125 }
      ],
      "/M": { "U": "D:20230329191605Z00'00" },
      "/F": { "I": 4 },
      "/P": { "I": 1 },
      "/Type": { "N": "/Annot" },
      "/Subtype": { "N": "/Highlight" }
    }

As you can see, the /QuadPoints and the (older, less accurate) /Rect are wrong. What is equally noticeable is that several other things are different. In other words, the round-trip process seems to have happened at a higher level of abstraction, or with a tool which likes to change things.

So I'm afraid I don't have an answer for you, but I hope this extra information helps in some way.

johnwhitington
  • 2,308
  • 1
  • 16
  • 18
  • That same question has been asked (and asnwered) on PyMuPDF's home page [here](https://github.com/pymupdf/PyMuPDF/discussions/2309). With more or less the same comment like yours. Adding the annotation on the new page, again using PyMuPDF, works **_exactly_** as it should. Unfortunately, the author gives no insight, which other tool he is using for re-inserting the annotation. The new annotation's coordinates are (1) relocated vertically, and (2) flipped right-left. – Jorj McKie Mar 30 '23 at 12:14
  • I think I understand the reason for the problem: That other tool uses PDF's default geometry convention i.e., point (0,0) is the page's bottom-left point. PyMuPDF's geometry - for good reasons - is point (0,0) at the page's **top-left**. So every coordinate handed to that other tool must be modified `(x, y) -> (x, height - y)`, where height is the page height. – Jorj McKie Mar 30 '23 at 13:10