I am trying to pull a list of all href
links within an HTML document, then these links will be fed into a System.Net.HttpWebRequest to get the HTML documents for those pages. Essentially making a crawler.
I use RegEx to pull a list of links from the page: href="(.*?)"
Problems arises when the links pulled from the page isn't strictly "http://www.example.com" and the various types of links I pull from the HTML document looks something like this: (fictional examples)
- http://www.example.com/products/productname
- http://example.com/products/productname
- www.example.com/products/productname
- /products/productname (relative links)
I need a way to normalize all these various types of links that I get into the format that HttpWebRequest accepts.
I've been searching for the last 3 days without much luck.