0

I have some PDFs containing Hyperlinks both in form of URL and mailto. Now Is there any way or tool(may be 3rd party) to extract the Hyperlink meta information form the PDF like coordinates, link type and destination address. Any help is highly appreciated.

I have already tried with iText and PDFBox but with no major success, even some third party software are not providing me the desired output.

I have tried the following code in Java using iText

        PdfReader myReader = new PdfReader("pdf File Path");
        PdfDictionary pageDict = myReader.getPageN(1);
        PdfArray annots = pageDict.getAsArray(PdfName.ANNOTS);
        System.out.println(annots);
        ArrayList<String> dests = new ArrayList<String>();
        if(annots != null) 
        {
            for(int i=0; i<annots.size(); ++i) 
            {
                PdfDictionary annotDict = annots.getAsDict(i);
                PdfName subType = annotDict.getAsName(PdfName.SUBTYPE);
                if (subType != null && PdfName.LINK.equals(subType)) 
                {
                    PdfDictionary action = annotDict.getAsDict(PdfName.A);
                    if(action != null && PdfName.URI.equals(action.getAsName(PdfName.S))) 
                    {
                        dests.add(action.getAsString(PdfName.URI).toString());
                    } // else { its an internal link }
                }
            }
        }        
        System.out.println(dests);
Tech Enthusiast
  • 279
  • 1
  • 5
  • 18
  • 1
    The example in @Bobrovsky's answer searches for link annotations using Doxotic, and a search for link annotations using iText or PDFBox would be similarly designed. Thus, are you sure those links in your document indeed are link annotations? E.g. Adobe Reader has an option making it make addresses in the content clickable as if they were link annotations while they are not. Maybe such a feature has made you believe there are link annotations while there actually are not. (BTW, you might want to supply the code you've tried; maybe it is erroneous.) – mkl Apr 25 '14 at 07:01
  • Thanks a lot mkl you have done it. Actually my code is working fine it is the property of the Adobe that was creating a hover link. Can you provide me the specs for Adobe for creating such property so that i can check it – Tech Enthusiast Apr 25 '14 at 10:17
  • 1
    Adobe Reader simply searches the page content for what it considers URLs and makes them interactive. You can switch this behavior on and off in the preferences. I don't know which *specs* to provide. – mkl Apr 25 '14 at 10:33
  • Cheers, I have checked it from the option in Edit>Preferences>General and then uncheck the option "Create links from Url". Again lot of kudos for your help. – Tech Enthusiast Apr 25 '14 at 10:40

3 Answers3

0

if your pdfs are copy protected, you need to start with step 1, if they're free to copy, you can start with step 2

step 1: convert your pdfs into word .doc: use Adobe Acrobat Pro or an online pdf to word converter:

http://www.pdfonline.com/pdf2word/index.asp

step 2: copy-paste the whole document into the input window here, you can also download the lightweight html tool:

http://www.surf7.net/services/value-added-services/free-web-tools/email-extractor-lite/

select 'url' as 'Type of address to extract', select your separator, hit extract and that's it.

Hope it works cheers.

Ankur Dubey
  • 454
  • 5
  • 15
  • I have tried Acrobat pro but it simply fails to do so in some instances. But how to capture the coordinates in term of x and y where hyperlink is imbibed in the PDF. – Tech Enthusiast Apr 24 '14 at 14:47
0

You can use Docotic.Pdf library for links extraction (disclaimer: I work for the company).

Below is the code that opens specified file, finds all hyperlinks, collects information about position of each link and draws rectangle around each links.

After that the code creates new PDF (with links in rectangles) and a text file with collected information. In the end, both created files are opened in default viewers.

public static void ListAndHighlightLinks(string inputFile, string outputFile, string outputTxt)
{
    using (PdfDocument doc = new PdfDocument(inputFile))
    {
        StringBuilder sb = new StringBuilder();

        for (int i = 0; i < doc.Pages.Count; i++)
        {
            PdfPage page = doc.Pages[i];
            foreach (PdfWidget widget in page.Widgets)
            {
                PdfActionArea actionArea = widget as PdfActionArea;
                if (actionArea == null)
                    continue;

                PdfUriAction linkAction = actionArea.Action as PdfUriAction;
                if (linkAction == null)
                    continue;

                Uri url = linkAction.Uri;
                PdfRectangle rect = actionArea.BoundingBox;

                // add information about found link into string buffer
                sb.Append("Page ");
                sb.Append(i.ToString());
                sb.Append(" : ");
                sb.Append(rect.ToString());
                sb.Append(" ");
                sb.AppendLine(url.ToString());

                // draw rectangle around found link
                page.Canvas.DrawRectangle(rect);
            }
        }

        // save document with highlighted links and text information about links to files
        doc.Save(outputFile);
        System.IO.File.WriteAllText(outputTxt, sb.ToString());

        // open created PDF and text file in default viewers
        System.Diagnostics.Process.Start(outputTxt);
        System.Diagnostics.Process.Start(outputFile);
    }
}

You can use the sample code with a call like this:

ListAndHighlightLinks("input.pdf", "output.pdf", "links.txt");
Bobrovsky
  • 13,789
  • 19
  • 80
  • 130
0

One possibility would be using a custom JavaScript in Acrobat, which would enumerate the "words" on the page and then read out their Quads. From that you get the coordinates to create a link (or to compare with the links on the page), as well as the actual text (that's the "word(s)".

If it is "only" to set the border of the existing links, you also do another Acrobat JavaScript which enumerates the links of the document, and set their border color property (and you may need to set the width as well).

(if you prefer "buy" over "make" feel free to contact me in private; such things are part of my standard "repertoire").

Max Wyss
  • 3,549
  • 2
  • 20
  • 26