How to retrieve page numbers of TOC(table of contents) from a PDF in python

Question

I have managed to retrieve the page number of the page from where the toc(table of contents) starts in a PDF. This works great if the toc is of specifically of 1 page. But now I am unable to come up with any good logic if there is a multi page toc in a pdf. Here I have hard coded it specifically to search for " Table of Contents" in a PDF. I would also like to know if there is a better way of doing this. This is the code i am working with right now.

 def get_toc_page(page_text, pg_no):   
     REG= "TABLE OF CONTENTS"
         if re.search(REG, page_text, re.IGNORECASE):
             search = re.search(REG, page_text, re.IGNORECASE)
             return REG, pg_no
        # else:
        #     return 'not found',pg_no

Here in the def function I was going to replace the regex with an expression that would include all forms of written 'table of contents' so that it will capture it in most of the cases, but this is hard coding so is there a better way to perform this same process?

filelist = glob.glob(path)

for doc_path in tqdm(filelist):
  
  file_name = os.path.basename(doc_path)
  
  # opening the pdf
  doc = fitz.open(doc_path) 

  #print(file_name, ' : ', len(doc)) #Number of pages in doc

  i = -1  
  # iterating through pages
  for page in doc:
      i +=1
      if i == 10: #checking 1st 10 pages
        break

      try:
        page.wrap_contents()
        page_text = page.get_text("text")
        if get_toc_page(page_text, i) != None:
          print(file_name, get_toc_page(page_text, i))
      except Exception as e:
        print(e)
        print('error with:', file_name)

This works decently well with 1 page TOCs, but how shall I do it in case of multi-page TOCs in a PDF where the output is like TOC spans from page no:3 to page no:7. Or is there a better approach to this problem available?

You obviously are talking about standard text inside pages. You are aware of PyMuPDF's `doc.get_toc()` method? Its output should always be priority for getting TOC information. — Jorj McKie, Aug 25 '23 at 10:33
@JorjMcKie yes I am aware of `doc.get_toc()`. I am trying to solve this where it doesn't perform great. For example if i am getting this particular output: [[1, '空白页面', 38], [1, '空白页面', 6]] — Harshal Naik, Aug 28 '23 at 07:24
https://stackoverflow.com/users/16878946/harshal-naik : but what is the problem with that example? Looks good to me. — Jorj McKie, Aug 29 '23 at 10:47

How to retrieve page numbers of TOC(table of contents) from a PDF in python

0 Answers0