How can I get the font name in pdf file

Question

I have written script to extract some information from pdf file.

My code:

for page in doc:
    rect = fitz.Rect(22, 52, 562,802)  # crop page margins to ignore header, footer, left side

    blocks = page.get_text("blocks",rect, flags=fitz.TEXTFLAGS_TEXT)

    for i in blocks:
        if (i[-3][0].isdigit()):#check if title

            if (i[-3].partition(" ")[0].count('.')==0):#check if subtitle
                nr=i[-3].partition(" ")[0]
                txt = (i[-3]).partition(" ")[2]
            else:
                sub_nr='="' +i[-3].partition(" ")[0]+ '"'
                sub_txt=i[-3].partition(" ")[2]

        elif (i[-3].startswith("[V2G")):
            id=i[-3].partition("\n")[0].replace("[", " ").replace("]"," ")
            text=i[-3].partition("\n")[2].strip()
            data.append(req(filename, nr, txt, sub_nr, sub_txt, id, text))

I would like to add another condition to the txt variable depends on the font name.

 if font1 == 'Cambria-Bold':
   txt=.....

how can I get the font name?

I have found this method in the pymupdf library page.get_fonts() but it shows the hole fonts in the page and not for specific text. how can I use this method for my purpose

Is there another library in python to get font info?

Thank you for helping

This is no problem in PyMuPDF: you just have to use a different variant `.get_text("dict",...)`. This will return a dictionary of stacked dictionaries, which is explained [here](https://pymupdf.readthedocs.io/en/latest/textpage.html#structure-of-dictionary-outputs). — Jorj McKie, Apr 04 '23 at 18:37
The lowest level inside this dictionary is the text "span": the portion of text in a line that has the same font properties: font name, fontsize, text color, etc. — Jorj McKie, Apr 04 '23 at 18:39

score 1 · Answer 1 · answered Apr 07 '23 at 22:55

disclaimer: I am the author of borb, the library used in this answer

In borb, rendering a Page is a process you can attach EventListener instances to. An EventListener gets notified whenever a rendering instruction (such as "render text" or "render image") is processed.

borb already comes with a few useful implementations of EventListener to get you started.

In particular, I would look at font_name_filter which passes rendering events to its children if it hits a particular Font.

You can find its code here.

    def _event_occurred(self, event: "Event") -> None:
        # filter ChunkOfTextRenderEvent
        if isinstance(event, ChunkOfTextRenderEvent):
            font_name: typing.Optional[str] = event.get_font().get_font_name()
            if font_name == self._font_name:
                for l in self._listeners:
                    l._event_occurred(event)
            return
        # default
        for l in self._listeners:
            l._event_occurred(event)

You can of course build your own EventListener and obtain:

the text that is being rendered
the coordinates at which the text is being rendered
the font in which the text is being rendered
etc

In order to learn how to work with EventListener objects, check out the documentation (it's in a separate repository) here.

How can I get the font name in pdf file

1 Answers1