1

I am trying to extract hyperlink present in each page with their anchor text from pdf using PymuPdf library. I am able to extract hyperlinks with their page numbers but couldn't able to extract anchor text/words for every hyperlinks.

Can anyone help me ?

Here is the code

import fitz # PyMuPDF

result = []

with fitz.open(file) as doc:

    for page_no in range(1, len(doc)+1):

        page = doc[page_no-1]

        for link in page.links():

            if "uri" in link:

                url = link["uri"]
                result.append([page_no, url])  

            else:
                pass
            

Thanks!

  • Please provide enough code so others can better understand or reproduce the problem. – Community Oct 03 '22 at 11:10
  • Thank you KJ for your response!....So it is not possible to extract the text which attached to the specific link? Actually that's what I have to do it in my use-case to extract all the possible text with their associated links from the PDF files. – gagan lohar Oct 05 '22 at 10:11
  • So in my use-case I have to create a dataframe which consist of 3 columns (page_no, text_name, links) whenever I pass any PDF into my code. From the above code I can able to fetch page_no and links but no idea about how to extract text_name associated to those links. – gagan lohar Oct 05 '22 at 10:15

1 Answers1

2

You can extract the text within the link's "hot area", link["from"] like this: text = page.get_textbox(link["from"]).

Also any other of the various page.get_text() variants can be used if you need more text detail (e.g. color, font, ...) by using the clip parameter. For example, page.get_text("dict", clip=link["from"]) delivers a dictionary of the text under the link rectangle with font name, font size, font color and more.

Jorj McKie
  • 2,062
  • 1
  • 13
  • 17
  • Thank you so much Jorj for your solution, after using your code I can able to extract 'from' values like this : for link : 1 => Rect(156.47000122070312, 258.22998046875, 202.99000549316406, 270.3800048828125) for link : 2 => Rect(209.63999938964844, 258.22998046875, 256.1600036621094, 270.3800048828125) But after getting coordinates for those links how can I extract the text? – gagan lohar Oct 10 '22 at 13:07
  • Not sure if I understand: just use `page.get_textbox(rect)` for every link rectangle `rect` of the links you are interested in. – Jorj McKie Oct 11 '22 at 14:18
  • Thanks Jorj...I went through your _Github_->https://github.com/pymupdf/PyMuPDF-Utilities/tree/master/textbox-extraction ...which was really helpful. From the above solution you asked to use **page.get_textbox(link["from"])** to extract link text coordinates and as I mentioned earlier I am getting **rect** values but after using **rect** values in **page.get_textbox(rect)** its showing empty string. However, when I am using **page.get_text("words")** from your _GithubRepo_ I am getting coordinates for each word and it really hard to find text coordinated for the attached link. – gagan lohar Oct 14 '22 at 14:13
  • and using **page.first_annot.rect** its throwing error like this **'NoneType' object has no attribute 'rect'** – gagan lohar Oct 14 '22 at 14:21
  • Links in (Py-) MuPDF do not count as annots - although they technically (per PDF spec) are in fact annotations. The reason for this is that MuPDF wants to provide 3 separate chains: one for links, one for annotations, one for fields. So your `.first_annot.rect` will never take a link for this. – Jorj McKie Oct 15 '22 at 15:08
  • If you see no text if you extract from the link rect, the reason may be that that rectangle technically is too small - mostly the height is the problem. This is the fault of the software creating the link. Solve this by increasing the extraction rect somewhat, for example take `link["from"] + (-5,-5,5,5)` which is a rectangle 5 pixels larger in every direction. – Jorj McKie Oct 15 '22 at 15:13
  • Thanks a ton **Jorj**....it helped...issue is resolved. – gagan lohar Oct 18 '22 at 07:47