1

I'm trying to build an ASP.NET page that can crawl web pages and display them correctly with all relevant html elements edited to include absolute URLs where appropriate.

This question has been partially answered here https://stackoverflow.com/a/2719712/696638

Using a combination of the answer above and this blog post http://blog.abodit.com/2010/03/a-simple-web-crawler-in-c-using-htmlagilitypack/ I have built the following;

public partial class Crawler : System.Web.UI.Page {
    protected void Page_Load(object sender, EventArgs e) {
        Response.Clear();

        string url = Request.QueryString["path"];

        WebClient client = new WebClient();
        byte[] requestHTML = client.DownloadData(url);
        string sourceHTML = new UTF8Encoding().GetString(requestHTML);

        HtmlDocument htmlDoc = new HtmlDocument();
        htmlDoc.LoadHtml(sourceHTML);

        foreach (HtmlNode link in htmlDoc.DocumentNode.SelectNodes("//a[@href]")) {
            if (!string.IsNullOrEmpty(link.Attributes["href"].Value)) {
                HtmlAttribute att = link.Attributes["href"];
                string href = att.Value;

                // ignore javascript on buttons using a tags
                if (href.StartsWith("javascript", StringComparison.InvariantCultureIgnoreCase)) continue;

                Uri urlNext = new Uri(href, UriKind.RelativeOrAbsolute);
                if (!urlNext.IsAbsoluteUri) {
                    urlNext = new Uri(new Uri(url), urlNext);
                    att.Value = urlNext.ToString();
                }
            }
        }

        Response.Write(htmlDoc.DocumentNode.OuterHtml);

    }
}

This only replaces the href attribute for links. By expanding this I'd like to know what the most efficient way would be to include;

  • href attribute for <a> elements
  • href attribute for <link> elements
  • src attribute for <script> elements
  • src attribute for <img> elements
  • action attribute for <form> elements

And any others people can think of?

Could these be found using a single call to SelectNodes with a monster xpath or would it be more efficient to call SelectNodes multiple times and iterrate through each collection?

Community
  • 1
  • 1
Red Taz
  • 4,159
  • 4
  • 38
  • 60

1 Answers1

3

The following should work:

SelectNodes("//*[@href or @src or @action]")

and then you'd have to adapt the if statement below.

Digbyswift
  • 10,310
  • 4
  • 38
  • 66
  • Thanks, had to change it to `SelectNodes("//*[@href or @src or @action]")` for it to select anything. Is this the most efficient solution? – Red Taz Jan 05 '12 at 12:49
  • Sorry, that's what I meant, oops. The efficiency will depend on certain factors like the size and the structure of the documents. If you know there are specific sections of a document that do not have any links, then you can work these into your xpath or even break the xpath into small queries. – Digbyswift Jan 05 '12 at 13:26