I have managed to retrieve the page number of the page from where the toc(table of contents) starts in a PDF. This works great if the toc is of specifically of 1 page. But now I am unable to come up with any good logic if there is a multi page toc in a pdf. Here I have hard coded it specifically to search for " Table of Contents" in a PDF. I would also like to know if there is a better way of doing this. This is the code i am working with right now.
def get_toc_page(page_text, pg_no):
REG= "TABLE OF CONTENTS"
if re.search(REG, page_text, re.IGNORECASE):
search = re.search(REG, page_text, re.IGNORECASE)
return REG, pg_no
# else:
# return 'not found',pg_no
Here in the def function I was going to replace the regex with an expression that would include all forms of written 'table of contents' so that it will capture it in most of the cases, but this is hard coding so is there a better way to perform this same process?
filelist = glob.glob(path)
for doc_path in tqdm(filelist):
file_name = os.path.basename(doc_path)
# opening the pdf
doc = fitz.open(doc_path)
#print(file_name, ' : ', len(doc)) #Number of pages in doc
i = -1
# iterating through pages
for page in doc:
i +=1
if i == 10: #checking 1st 10 pages
break
try:
page.wrap_contents()
page_text = page.get_text("text")
if get_toc_page(page_text, i) != None:
print(file_name, get_toc_page(page_text, i))
except Exception as e:
print(e)
print('error with:', file_name)
This works decently well with 1 page TOCs, but how shall I do it in case of multi-page TOCs in a PDF where the output is like TOC spans from page no:3 to page no:7. Or is there a better approach to this problem available?