Screen-scraping for PDF links to download

Question

I'm learning C# through creating a small program, and couldn't find a similar post (apologies if this answer is posted somewhere else).

How might I go about screen-scraping a website for links to PDFs (which I can then download to a specified location)? Sometimes a page will have a link to another HTML page which has the actual PDF link, so if the actual PDF can't be found on the first page I'd like it to automatically look for a link that have "PDF" in the text of a link, and then search that resulting HTML page for the real PDF link.

I know that I could probably achieve something similar via filetype searching through google, but that seems like "cheating" to me :) I'd rather learn how to do it in code, but I'm not sure where to start. I'm a little familiar with XML parsing with XElement and such, but I'm not sure how to do it for getting links from an HTML page (or other format?).

Could anyone point me in the right direction? Thanks!

Maxim Gueivandov · Accepted Answer · 2011-03-14T21:19:10.080

HtmlAgilityPack is great for this kind of stuff.

Example of implementation:

string pdfLinksUrl = "http://www.google.com/search?q=filetype%3Apdf";

// Load HTML content    
var webGet = new HtmlAgilityPack.HtmlWeb();
var doc = webGet.Load(pdfLinksUrl);

// select all <A> nodes from the document using XPath
// (unfortunately we can't select attribute nodes directly as
// it is not yet supported by HAP)
var linkNodes = doc.DocumentNode.SelectNodes("//a[@href]");

// select all href attribute values ending with '.pdf' (case-insensitive)
var pdfUrls = from linkNode in linkNodes
    let href = linkNode.Attributes["href"].Value
    where href.ToLower().EndsWith(".pdf")
    select href;

// write all PDF links to file
System.IO.File.WriteAllLines(@"c:\pdflinks.txt", pdfUrls.ToArray());

As a side note, I would not rely too much on XPath expressions in HAP. There are some XPath functions missing, and putting all extraction logic inside your XPath will make your code less maintainable. I would extract a fair minimum using an XPath expression, and then do all required extraction by iterating through the node collection (Linq methods help a lot).

The real power of the HAP is the ability to parse SGML documents, that is, something which can be invalid from the XHTML point of view (unclosed tags, missing quotes, etc.).

What is the type of the variable pdfUrls here? How can I loop through each of the links that are contained within pdfUrls? Is is still accessed like an array pdfUrls[0], pdfUrls[1]? — devcoder, Sep 30 '13 at 00:25
@Maxim Gueivandov i changed search path as string pdfLinksUrl = "http://www.google.com/search?q=cloud+computing+filetype%3Apdf"; and got no results — Narasappa, Apr 25 '16 at 11:21

score 0 · Answer 2 · answered Mar 11 '11 at 23:15

0

Your best bet is probably to use HTML Agility to screen scrape the page, then select the href attribute to see if it looks like a PDF download. If not, you could then look at the text within the node for keywords such as PDF to decide whether to follow the link or not.

answered Mar 11 '11 at 23:15

David Cornish

371
1
7

1

If you're a jQuery fan, you might also like fizzler - http://code.google.com/p/fizzler/ – David Cornish Mar 11 '11 at 23:16

score 0 · Answer 3 · answered Mar 11 '11 at 23:15

0

For parsing of any HTML page, use HtmlAgilityPack. It's the best around.

From that you transform any HTMl page into XML which you can search through much easier than HTML.

If you need to crawl a site for information, have a look at NCrawler.

answered Mar 11 '11 at 23:15

Mikael Östberg

16,982
6
61
79

I'll take a look at NCrawler as well, thanks. I'll post back here if it works out. – superwillis Mar 14 '11 at 19:16

Screen-scraping for PDF links to download

3 Answers3