I am parsing PDF file using PyMuPDF (great library by the way!)
But I need to identify words, that are crossed out.
Is there any way to do that?
I am parsing PDF file using PyMuPDF (great library by the way!)
But I need to identify words, that are crossed out.
Is there any way to do that?
PyMUPDF docs do not seem to talk about crossed out (strike-out), except when dealing with annotations, but they do talk about these "flags".
bit 0: superscripted (20)
bit 1: italic (21)
bit 2: serifed (22)
bit 3: monospaced (23)
bit 4: bold (24)
So there might well be a code for strike-out that is not listed in the docs.
One way to access these codes is via the textPage dictionary structure using the spam tag. For more info on this see 6.18.1.4 Span Dictionary
in the docs.
I wanted to pulled bold text out of documents and wrote this function
def get_bold_text_from_PDF_page(page_number):
'''
Function to get bold text from PyMUPDF Page object
Parameters:
PyMUPDF doc generator object
Returns:
list of dictionaries each dictionary contains these fields:
Size, flags, font, color, ascender, decender, text, origin, bbox, page_number
'''
blocks = fitz.Page.get_text(page_number, "dict", flags=11)["blocks"]
page_bold_text_list = []
for block in blocks:
for line in block["lines"]:
for span in line["spans"]:
if span['flags'] == 20: # change the 20 here.
span['page_number'] = page.number
page_bold_text_list.append(span)
print(page_bold_text_list)
return page_bold_text_list
Strangely for me 20 was bold.
This Adobe docs might be worth a read https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf