How to find highlighted text in pdf with python (or differentiate it from its unhighlighted counterpart)

Question

I have several pdf files containing multiple choice questions where choices are formatted as a table and their answers formatted as the same but the correct answer is highlighted.

I want to create a pdf or txt file only with the questions and a seperate pdf or txt file only with the answers in order (like 1-D, 2-C, 3-A etc)

Background and details: As each question starts with word "Question", it is relatively straight forward to extract questions and as each answer revealing page has at least B and C choices it is also straight forward to find where the answer revealing page is, but somehow couldn't find how to know if a specific text is highlighted.

My backup plan is to either do it manually by extracting answer revealing pages in order in a separate pdf file or convert the choices tables in the question and in the answer revealing page and convert them to images to see if I can detect gray highlights in the images but this might mean more suffering than doing it manually.

Two tables (choices for the question and the answer revealing page with choices) are always in different pages and *almost always directly subsequent *if the question fits in one page but not fitting in one page is rare

from pypdf import PdfReader, PdfWriter

flList = glob.glob('samplepath')
writer = PdfWriter()
questionString = ''


for fl in flList:
    reader = PdfReader(fl)
    print(fl)
    for i_page,page in enumerate( reader.pages):
        txPag = page.extract_text()
        if "Question" in txPag:
            questionString += txPag
        elif ("\nB" in txPag) & ("\nC" in txPag):
            #TODO: Answer should be here, but how to extract the highlighted choice and store independently
            None

The `pypdf` does not provide support for extracting the highlighted text. For this, You can use `pdfplumber` module in Python. `pip install pdfplumber`. You can see this doc - https://pypi.org/project/pdfplumber/. — Pravash Panigrahi, May 07 '23 at 10:55
Check this question https://stackoverflow.com/questions/58110777/how-to-get-background-color-of-a-text-in-pymupdf — Andrey Ivanov, May 07 '23 at 11:00
PyMuPDF allows you to find the prevalent color within a rectangle. So if you know the text rectangle or the rectangles within which to find the questions / answers, you can make a pixmap `pix = page.get_pixmap(clip=rect)`, then check `top_color = pix.color_topusage()`. Then top_color is a tuple `(factor, pixel)`, where factor is a float between 0 and 1, the % of the area covered by RGB color `pixel`. A bytes value of length 3. White would be `b'\xff\xff\xff'`, blue `b'\x00\x00\xff'`, some gray `b'\xc8\xc8\xc8'` etc. — Jorj McKie, May 07 '23 at 12:00
Taking screenshots of your picture, rectangle of text with white background deliver `(0.152, b'\xff\xff\xff\xff')` as top color, the gray one `(0.179, b'\x7f\x7f\x7f\xff')`. So should be clearly distinguishable from each other. — Jorj McKie, May 07 '23 at 12:18
Thank you all very much, PyMuPDF done it, only installing it on M1 was a bit tricky. Can I somehow upvote you or is it ok if I post the answer myself as the code and mark as the right answer ? — Uğur Dinç, May 07 '23 at 16:08

How to find highlighted text in pdf with python (or differentiate it from its unhighlighted counterpart)

0 Answers0