1

Problem Statement

  1. Reading pdf and search for a word.
  2. If word found, annotate the word and get an area cropped around the annotated text from the pdf file.
  3. Each cropped image should only have one annotation.

Libraries and Versions

  1. python-3.6
  2. fitz-0.0.1.dev2
  3. pymupdf-1.17.5

Issue facing

For the first two iterations the annotation is perfect and cropping also works perfectly as expected. But by iterating for next occurence of search word from text instances then crop around that area as well as annotation of search word gets failed. Can't Find a solution for this problem.

def cropPdf( pdfName,word):
    c=0
    # opening the pdf file using fitz
    fitz_doc=fitz.open(pdfName)

    # getting first page of the doc
    fitz_page=fitz_doc[0]
    # finding all instances where the searchword is found
    text_instances=fitz_page.searchFor(word)
    # Iterating through each text instances  
    for text_cord in text_instances:
        c=c+1
        pdfPath = "./" + pdfName + ".pdf"
        # To add highlight(Rectangle Annotation) around the search word
        highlight = fitz_page.addRectAnnot(text_cord)
        # getting the bounding box cordinate
        x0,y0,x1,y1=highlight.rect
        # here i set the cropping area around the annotated text
        fitz_page.setCropBox(fitz.Rect(x0+600,y0+600,x0-600,y0-600))
        #
        pix=fitz_page.getPixmap()
        print(fitz_page.number)
        base_name_highlight="output"+str(c)+".png"
        # saving the cropped area as png file
        pix.writeImage("./highlight_folder/"+base_name_highlight)
        # Deleting the marked annotation which helps me to avoid duplicate annotation inside a cropped area,
        # when starting to annotate the next occurence of the word to annotate while iterating.
        fitz_page.deleteAnnot(highlight)

cropPdf(pdfName="A4_4.pdf",word="INSULATION")

Result Images

  1. Expected Output for all cropped image enter image description here

  2. False Case while cropping

enter image description here

Vishal Singh
  • 6,014
  • 2
  • 17
  • 33
Jacob Lawrence
  • 145
  • 1
  • 2
  • 9

1 Answers1

0

A change to the cropbox can impact all the cordinates of the page. So before entering into the loop for annotation i should specify the initial state of cropbox in a variable. and at the end of each iteration i should reset to the initial cropbox which will help to annotate the next occurence without any change to the cordinates

Jacob Lawrence
  • 145
  • 1
  • 2
  • 9