How download linked pdf files from website?

Question

I want to download hundreds of pdf documents from a site. I have tried tools such as SiteSucker and similar, but it does not work, because there appears to be some "separation" between the files and the page that links to them. I don't know how to describe this in a better way, since I don't know that much about website programming or scraping. Any advice on what this can be and how one can circumvent it?

More specifically, I am trying to download pdfs of UN resolutions, stored on pages like this one: http://www.un.org/depts/dhl/resguide/r53_en.shtml

It appears there is an in-built "search function," on the UN site, which makes dummy scraping, like SiteSucker, not work as intended.

Are there other tools that I can use?

score 0 · Answer 1 · answered Sep 11 '14 at 10:12

0

Clicking a link on the page you mentioned redirects to a page composed by two frames (html). The first one is the "header" and the second one loads a page to generate the PDF file and embed it inside. The URL of the PDF file is hard to guess. I don't know of free tool that could scrap this type of page.

Here is an example of the url in the second frame that ends to the PDF file:

http://daccess-dds-ny.un.org/doc/UNDOC/GEN/N99/774/43/PDF/N9977443.pdf?OpenElement

answered Sep 11 '14 at 10:12

Grégory Vorbe

374
1
8

Thanks for this. Would you know of a non-free tool that can do the job? – Magnus Sep 30 '14 at 10:40
Similar question. Another site that contains the same resolutions, but without the double layers. E.g., for one year, http://www.worldlii.org/int/other/UNGARsn/1952/ There is a robots.txt block. Is there any way to get around this kind of stuff? – Magnus Sep 30 '14 at 12:03

How download linked pdf files from website?

1 Answers1