0

I am trying to pull a list of all href links within an HTML document, then these links will be fed into a System.Net.HttpWebRequest to get the HTML documents for those pages. Essentially making a crawler.

I use RegEx to pull a list of links from the page: href="(.*?)"

Problems arises when the links pulled from the page isn't strictly "http://www.example.com" and the various types of links I pull from the HTML document looks something like this: (fictional examples)

I need a way to normalize all these various types of links that I get into the format that HttpWebRequest accepts.

I've been searching for the last 3 days without much luck.

Simon Jensen
  • 488
  • 1
  • 3
  • 19
  • What have you already tried? – ProgrammingLlama Apr 16 '18 at 07:14
  • The only solution I could think of was by matching strings to something predefined, which would quickly get messy and ridgit, which is why I tried looking for better options. But my research got me nowhere, so I must admit I don't have much to show. The best I found was this: https://stackoverflow.com/questions/11363493/what-is-best-way-to-normalize-an-uri-to-extract-just-the-domain-name but that wont help on the relative links – Simon Jensen Apr 16 '18 at 07:20
  • 1
    I won't post this as an answer yet since I'm aware it's flawed (for example, ./abc.htm` is a valid relative URL in HTML, but it will throw an error) and will try and work on it more later. Maybe you can improve it for your user case. If you do before me, please add the updated version as an answer. Plus there's probably a better way to handle the domain check (other than just checking for a .) [Code](https://pastebin.com/45PYR7jv). Also, I'm not sure if I should put `www.` at the start of hosts without, since not all sites define a record for `www` – ProgrammingLlama Apr 16 '18 at 07:31
  • For this case, the 'www' wouldn't make an impact. Your code example looks interesting, but is there any chance you can add commentation explaining what everything does? – Simon Jensen Apr 16 '18 at 07:43
  • I will this evening when I have a chance to improve it :) – ProgrammingLlama Apr 16 '18 at 07:46
  • when calling your method, do I do as following: normalizeUrl(urlToTest, "www.example.com") – Simon Jensen Apr 16 '18 at 07:49
  • 1
    `http://www.example.com`, but yes. – ProgrammingLlama Apr 16 '18 at 07:49
  • 1
    See [How to get img/src or a/hrefs using Html Agility Pack?](https://stackoverflow.com/questions/4835868/how-to-get-img-src-or-a-hrefs-using-html-agility-pack/49853660#49853660) discussion on StackOverflow. Also note that `www.example.com/products/productname` is a relative URL that just looks like absolute. – Leonid Vasilev Apr 16 '18 at 09:18

0 Answers0