how to verify links in a PDF file

Question

I have a PDF file which I want to verify whether the links in that are proper. Proper in the sense - all URLs specified are linked to web pages and nothing is broken. I am looking for a simple utility or a script which can do it easily ?!

Example:

$ testlinks my.pdf
There are 2348 links in this pdf.
2322 links are proper.
Remaining broken links and page numbers in which it appears are logged in brokenlinks.txt

I have no idea of whether something like that exists, so googled & searched in stackoverflow also. But did not find anything useful yet. So would like to anyone has any idea about it !

Updated: to make the question clear.

Federico · Answer 1 · 2020-05-05T07:50:44.210

7

You can use pdf-link-checker

pdf-link-checker is a simple tool that parses a PDF document and checks for broken hyperlinks. It does this by sending simple HTTP requests to each link found in a given document.

To install it with pip:

pip install pdf-link-checker

Unfortunately, one dependency (pdfminer) is broken. To fix it:

pip uninstall pdfminer
pip install pdfminer==20110515

edited May 05 '20 at 07:50

answered Feb 11 '17 at 16:46

Federico

1,925
14
19

1

The note about pdfminer saved my day – HoRn May 05 '20 at 10:03
Unfortunately, does not work currently under Python 3.7. Generates syntax error. – Arthur Sep 09 '20 at 23:19

score 5 · Accepted Answer · answered Nov 19 '11 at 00:51

I suggest first using the linux command line utility 'pdftotext' - you can find the man page:

pdftotext man page

The utility is part of the Xpdf collection of PDF processing tools, available on most linux distributions. See http://foolabs.com/xpdf/download.html.

Once installed, you could process the PDF file through pdftotext:

pdftotext file.pdf file.txt

Once processed, a simple perl script that searched the resulting text file for http URLs, and retrieved them using LWP::Simple. LWP::Simple->get('http://...') will allow you to validate the URLs with a code snippet such as:

use LWP::Simple;
$content = get("http://www.sn.no/");
die "Couldn't get it!" unless defined $content;

That would accomplish what you want to do, I think. There are plenty of resources on how to write regular expressions to match http URLs, but a very simple one would look like this:

m/http[^\s]+/i

"http followed by one or more not-space characters" - assuming the URLs are property URL encoded.

thanks anyway, I will write my own script with these utilities then ! — user379997, Nov 19 '11 at 16:56

score 1 · Answer 3 · answered Nov 12 '11 at 00:02

1

There are two lines of enquiry with your question.

Are you looking for regex verification that the link contains key information such as http:// and valid TLD codes? If so I'm sure a regex expert will drop by, or have a look at regexlib.com which contains lots of existing regex for dealing with URLs.

Or are you wanting to verify that a website exists then I would recommend Python + Requests as you could script out checks to see if websites exist and don't return error codes.

It's a task which I'm currently undertaking for pretty much the same purpose at work. We have about 54k links to get processed automatically.

answered Nov 12 '11 at 00:02

Peter Brooks

1,170
9
23

my question is to verify whether the links are not broken ! Thanks. I've updated the question properly. – user379997 Nov 12 '11 at 11:09
Are broken links defined as incorrect http syntax or HTTP errors when you reach them? – Peter Brooks Nov 12 '11 at 11:48
Can you give examples of such page errors? I would defiantly recommend you to use Requests and see what logic you can apply to the response from the HTTP. – Peter Brooks Nov 12 '11 at 12:13

serv-inc · Answer 4 · 2018-12-11T06:13:47.783

https://stackoverflow.com/a/42178474/1587329's advice was inspiration to write this simple tool (see gist):

'''loads pdf file in sys.argv[1], extracts URLs, tries to load each URL'''
import urllib
import sys

import PyPDF2

# credits to stackoverflow.com/questions/27744210
def extract_urls(filename):
    '''extracts all urls from filename'''
    PDFFile = open(filename,'rb')
    PDF = PyPDF2.PdfFileReader(PDFFile)
    pages = PDF.getNumPages()

    key = '/Annots'
    uri = '/URI'
    ank = '/A'

    for page in range(pages):
        pageSliced = PDF.getPage(page)
        pageObject = pageSliced.getObject()
        if pageObject.has_key(key):
            ann = pageObject[key]
            for a in ann:
                u = a.getObject()
                if u[ank].has_key(uri):
                    yield u[ank][uri]


def check_http_url(url):
    urllib.urlopen(url)


if __name__ == "__main__":
    for url in extract_urls(sys.argv[1]):
        check_http_url(url)

Save to filename.py, run as python filename.py pdfname.pdf.

score 0 · Answer 5 · answered Nov 12 '11 at 14:18

0

Collect links by:
enumerating links using API, or dumping as text and linkifying the result, or saving as html PDFMiner.
Make requests to check them:
there are plethora of options depending on your needs.

answered Nov 12 '11 at 14:18

jfs

399,953
195
994
1,670

how to verify links in a PDF file

5 Answers5