-1

I want to download hundreds of pdf documents from a site. I have tried tools such as SiteSucker and similar, but it does not work, because there appears to be some "separation" between the files and the page that links to them. I don't know how to describe this in a better way, since I don't know that much about website programming or scraping. Any advice on what this can be and how one can circumvent it?

More specifically, I am trying to download pdfs of UN resolutions, stored on pages like this one: http://www.un.org/depts/dhl/resguide/r53_en.shtml

It appears there is an in-built "search function," on the UN site, which makes dummy scraping, like SiteSucker, not work as intended.

Are there other tools that I can use?

Magnus
  • 1
  • 1

1 Answers1

0

Clicking a link on the page you mentioned redirects to a page composed by two frames (html). The first one is the "header" and the second one loads a page to generate the PDF file and embed it inside. The URL of the PDF file is hard to guess. I don't know of free tool that could scrap this type of page.

Here is an example of the url in the second frame that ends to the PDF file:

http://daccess-dds-ny.un.org/doc/UNDOC/GEN/N99/774/43/PDF/N9977443.pdf?OpenElement

Grégory Vorbe
  • 374
  • 1
  • 8
  • Thanks for this. Would you know of a non-free tool that can do the job? – Magnus Sep 30 '14 at 10:40
  • Similar question. Another site that contains the same resolutions, but without the double layers. E.g., for one year, http://www.worldlii.org/int/other/UNGARsn/1952/ There is a robots.txt block. Is there any way to get around this kind of stuff? – Magnus Sep 30 '14 at 12:03