0

I have written script to extract some information from pdf file.

My code:

for page in doc:
    rect = fitz.Rect(22, 52, 562,802)  # crop page margins to ignore header, footer, left side

    blocks = page.get_text("blocks",rect, flags=fitz.TEXTFLAGS_TEXT)

    for i in blocks:
        if (i[-3][0].isdigit()):#check if title

            if (i[-3].partition(" ")[0].count('.')==0):#check if subtitle
                nr=i[-3].partition(" ")[0]
                txt = (i[-3]).partition(" ")[2]
            else:
                sub_nr='="' +i[-3].partition(" ")[0]+ '"'
                sub_txt=i[-3].partition(" ")[2]

        elif (i[-3].startswith("[V2G")):
            id=i[-3].partition("\n")[0].replace("[", " ").replace("]"," ")
            text=i[-3].partition("\n")[2].strip()
            data.append(req(filename, nr, txt, sub_nr, sub_txt, id, text))

I would like to add another condition to the txt variable depends on the font name.

 if font1 == 'Cambria-Bold':
   txt=.....

how can I get the font name?

I have found this method in the pymupdf library page.get_fonts() but it shows the hole fonts in the page and not for specific text. how can I use this method for my purpose

Is there another library in python to get font info?

Thank you for helping

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
user34088
  • 21
  • 4
  • 1
    This is no problem in PyMuPDF: you just have to use a different variant `.get_text("dict",...)`. This will return a dictionary of stacked dictionaries, which is explained [here](https://pymupdf.readthedocs.io/en/latest/textpage.html#structure-of-dictionary-outputs). – Jorj McKie Apr 04 '23 at 18:37
  • 1
    The lowest level inside this dictionary is the text "span": the portion of text in a line that has the same font properties: font name, fontsize, text color, etc. – Jorj McKie Apr 04 '23 at 18:39

1 Answers1

1

disclaimer: I am the author of borb, the library used in this answer

In borb, rendering a Page is a process you can attach EventListener instances to. An EventListener gets notified whenever a rendering instruction (such as "render text" or "render image") is processed.

borb already comes with a few useful implementations of EventListener to get you started.

In particular, I would look at font_name_filter which passes rendering events to its children if it hits a particular Font.

You can find its code here.

    def _event_occurred(self, event: "Event") -> None:
        # filter ChunkOfTextRenderEvent
        if isinstance(event, ChunkOfTextRenderEvent):
            font_name: typing.Optional[str] = event.get_font().get_font_name()
            if font_name == self._font_name:
                for l in self._listeners:
                    l._event_occurred(event)
            return
        # default
        for l in self._listeners:
            l._event_occurred(event)

You can of course build your own EventListener and obtain:

  • the text that is being rendered
  • the coordinates at which the text is being rendered
  • the font in which the text is being rendered
  • etc

In order to learn how to work with EventListener objects, check out the documentation (it's in a separate repository) here.

Joris Schellekens
  • 8,483
  • 2
  • 23
  • 54