1

Below is a piece of my code, where I'm searching for a particular word & extracting their coordinates.

As per the documentation page.searchFor(), page.searchFor(needle, hit_max=16, quads=False, flags=None). Searches for needle on a page. Upper/lower case is ignored. The string may contain spaces.

First, I want the coordinates for an exact match. Secondly, if the selected word is "inter", it will also extract the coordinate of "inter" from the word internalization present in the document which conflicts with my task.

Is there any way I can achieve the same?

doc = fitz.open(document_name)

words = ["Midpoint", "CORPORATE", "internalization"]

for page in doc:
  page._wrapContents()

  for word in words:
      text_instances = page.searchFor(word)

      for rect_coordinates in text_instances:
             page.addRedactAnnot(rect_coordinates, text_color = (0,0,0), fill = (0,0,0))

      page.apply_redactions()

RevolverRakk
  • 309
  • 4
  • 10

2 Answers2

1

You can use the page.getText("words") for getting the words on the page, along with their location.

A workaround that worked for me was using the page.searchFor() to get the location of a possible match, and passing clip parameter in the getText based on a larger rectangle using this location. Then, I checked all the words in the getText for a match using re.

However, you can just get all the words using page.getText("words") and iterate on all the obtained words, as you just want exact word matches. You can pass flags for the handling the hyphenation as well. Refer to the doc link

1

You can expand the rect boundaries if your searchterm and validate if there are any adjacent text around the found match.

Below function(isExactMatch()) allows you to optionally enable ExactMatch and CaseSensitive match

def isExactMatch(page, term, clip, fullMatch=False, caseSensitive=False):
# clip is an item from page.search_for(term, quads=True)

    termLen = len(term)
    termBboxLen = max(clip.height, clip.width)
    termfontSize = termBboxLen/termLen
    f = termfontSize*2

    clip = clip.rect

    validate = page.get_text("blocks", clip = clip + (-f, -f, f, f), flags=0)[0][4]
    flag = 0
    if not caseSensitive:
        flag = re.IGNORECASE

    matches = len(re.findall(f'{term}', validate, flags=flag)) > 0
    if fullMatch:
        matches = len(re.findall(f'\\b{term}\\b', validate))>0
    return matches

# how to use isExactMatch function

term = "my_searchterm"
coordinates = page.search_for(term)
for inst in coordinates:
    if isExactMatch(page, term, inst, exactMatch=True, matchCase=False):
        print("DoSomething")

Note that f = termfontSize*2 is being use to expand the boundary by f units in all directions. the value of f is 2times the average length of each term in the bbox


UPDATE: Sep 22, 2021:

Note that this function doesn't work properly to match text which is in multiple lines since the clip-region doesn't cover all the lines.

Gangula
  • 5,193
  • 4
  • 30
  • 59
  • Note that I was previously using `page.get_textbox`, but that was returning an array of individual words, which made checking for exact match difficult. So I switched to `page.get_text` which returns an array where the 5th item (`[4]`) is the whole text as one string. – Gangula Sep 22 '21 at 13:25
  • I have initiated a discussion for this in the GitHub Repo: [Exact match using PuMuPDF](https://github.com/pymupdf/PyMuPDF/discussions/1277) – Gangula Sep 22 '21 at 13:39