I am trying to extract text using Borb from a PDF and i can see there is a clear example to extract text with font names:
# create FontNameFilter
l0: FontNameFilter = FontNameFilter("Helvetica")
# filtered text just gets passed to SimpleTextExtraction
l1: SimpleTextExtraction = SimpleTextExtraction()
l0.add_listener(l1)
# read the Document
doc: typing.Optional[Document] = None
with open("UMIR-01032023-EN_4.pdf", "rb") as in_file_handle:
doc = PDF.loads(in_file_handle, [l0])
# check whether we have read a Document
assert doc is not None
# print the names of the Fonts
print(l1.get_text()[0])# create FontNameFilter
l0: FontNameFilter = FontNameFilter("Helvetica")
# filtered text just gets passed to SimpleTextExtraction
l1: SimpleTextExtraction = SimpleTextExtraction()
l0.add_listener(l1)
# read the Document
doc: typing.Optional[Document] = None
with open("UMIR-01032023-EN_4.pdf", "rb") as in_file_handle:
doc = PDF.loads(in_file_handle, [l0])
# check whether we have read a Document
assert doc is not None
# print the names of the Fonts
print(l1.get_text()[0])
I wanted to know if there is a way to extract the text using regex in font names for example:
If font name one is: ABCD-Font
Font name two is: ABCD-Bold-Font
How can i extract both.