I have several pdf files containing multiple choice questions where choices are formatted as a table and their answers formatted as the same but the correct answer is highlighted.
I want to create a pdf or txt file only with the questions and a seperate pdf or txt file only with the answers in order (like 1-D, 2-C, 3-A etc)
Background and details: As each question starts with word "Question", it is relatively straight forward to extract questions and as each answer revealing page has at least B and C choices it is also straight forward to find where the answer revealing page is, but somehow couldn't find how to know if a specific text is highlighted.
My backup plan is to either do it manually by extracting answer revealing pages in order in a separate pdf file or convert the choices tables in the question and in the answer revealing page and convert them to images to see if I can detect gray highlights in the images but this might mean more suffering than doing it manually.
Two tables (choices for the question and the answer revealing page with choices) are always in different pages and *almost always directly subsequent *if the question fits in one page but not fitting in one page is rare
from pypdf import PdfReader, PdfWriter
flList = glob.glob('samplepath')
writer = PdfWriter()
questionString = ''
for fl in flList:
reader = PdfReader(fl)
print(fl)
for i_page,page in enumerate( reader.pages):
txPag = page.extract_text()
if "Question" in txPag:
questionString += txPag
elif ("\nB" in txPag) & ("\nC" in txPag):
#TODO: Answer should be here, but how to extract the highlighted choice and store independently
None