Is there any way to identify crossed out words in PDF file while parsing it using Python?

Question

I am parsing PDF file using PyMuPDF (great library by the way!)

But I need to identify words, that are crossed out.

Is there any way to do that?

can you include a sample pdf? – Vishal Singh Aug 28 '20 at 04:05 — Vishal Singh, Aug 28 '20 at 04:05

Cam · Answer 1 · 2021-07-15T21:35:02.317

PyMUPDF docs do not seem to talk about crossed out (strike-out), except when dealing with annotations, but they do talk about these "flags".

bit 0: superscripted (20)
bit 1: italic (21)
bit 2: serifed (22)
bit 3: monospaced (23)
bit 4: bold (24)

So there might well be a code for strike-out that is not listed in the docs.

One way to access these codes is via the textPage dictionary structure using the spam tag. For more info on this see 6.18.1.4 Span Dictionary in the docs.

I wanted to pulled bold text out of documents and wrote this function

def get_bold_text_from_PDF_page(page_number):
    '''
    Function to get bold text from PyMUPDF Page object
    Parameters:
            PyMUPDF doc generator object
    Returns:
            list of dictionaries each dictionary contains these fields:
            Size, flags, font, color, ascender, decender, text, origin, bbox, page_number
    '''
    blocks = fitz.Page.get_text(page_number, "dict", flags=11)["blocks"]
    page_bold_text_list = []
    for block in blocks:  
        for line in block["lines"]: 
            for span in line["spans"]: 
                if span['flags'] == 20:  # change the 20 here.
                    span['page_number'] = page.number  
                    page_bold_text_list.append(span)
                    print(page_bold_text_list)
    return page_bold_text_list

Strangely for me 20 was bold.

This Adobe docs might be worth a read https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf

Is there any way to identify crossed out words in PDF file while parsing it using Python?

1 Answers1