Filter text in PDF by font with Borb using regex

Question

I am trying to extract text using Borb from a PDF and i can see there is a clear example to extract text with font names:

    # create FontNameFilter
    l0: FontNameFilter = FontNameFilter("Helvetica")

    # filtered text just gets passed to SimpleTextExtraction
    l1: SimpleTextExtraction = SimpleTextExtraction()
    l0.add_listener(l1)

    # read the Document
    doc: typing.Optional[Document] = None
    with open("UMIR-01032023-EN_4.pdf", "rb") as in_file_handle:
        doc = PDF.loads(in_file_handle, [l0])

    # check whether we have read a Document
    assert doc is not None

    # print the names of the Fonts
    print(l1.get_text()[0])# create FontNameFilter
    l0: FontNameFilter = FontNameFilter("Helvetica")

    # filtered text just gets passed to SimpleTextExtraction
    l1: SimpleTextExtraction = SimpleTextExtraction()
    l0.add_listener(l1)

    # read the Document
    doc: typing.Optional[Document] = None
    with open("UMIR-01032023-EN_4.pdf", "rb") as in_file_handle:
        doc = PDF.loads(in_file_handle, [l0])

    # check whether we have read a Document
    assert doc is not None

    # print the names of the Fonts
    print(l1.get_text()[0])

I wanted to know if there is a way to extract the text using regex in font names for example: If font name one is: ABCD-Font Font name two is: ABCD-Bold-Font

How can i extract both.

Have you tried creating another `FontNameFilter` and calling `l0.add_listener` a second time? — Tim Roberts, Mar 17 '23 at 18:20
TBH i think that is the answer, i am creating a custom `FontNameFilter` which will look for `startswith` instead of exact match. — IamButtman, Mar 17 '23 at 20:47

score 1 · Accepted Answer · answered Mar 18 '23 at 03:12

So as suggested by Tim Roberts in a comment above. I ended up creating a custom FontNameFilter which compares the font using startswith instead of an exact match:

    def _event_occurred(self, event: "Event") -> None:
        # filter ChunkOfTextRenderEvent
        if isinstance(event, ChunkOfTextRenderEvent):
            font_name: typing.Optional[str] = event.get_font().get_font_name()
            if font_name.startswith(self._font_name):
                for l in self._listeners:
                    l._event_occurred(event)
            return
        # default
        for l in self._listeners:
            l._event_occurred(event)

Filter text in PDF by font with Borb using regex

1 Answers1