1

I have a task to write a program on C#, which finds all http-links from a website. Now I've write a such function for it:

async static void DownloadWebPage(string url)
{
  using (HttpClient client = new HttpClient()) 
  using (HttpResponseMessage response = await client.GetAsync(url))
  using (HttpContent content = response.Content)
  {
    string[] resArr;
    string result = await content.ReadAsStringAsync();
    resArr = result.Split(new string[] {"href"}, StringSplitOptions.RemoveEmptyEntries);//splitting

    //here must be some code-string which finds all neccessary http-links from resArr

    Console.WriteLine("Main page of " + url + " size = " + result.Length.ToString());
  }
}

Using this function I load a web-page content to the string, then I parse this string and write results to array, using "href"-splitter, then I check every array-unit on string, which contents "href" substring.So I can get strings, which content http-links. Problem starts when the string is spliting, because impossible to find http-links, to my mind this is due to content-format of this string.How to fix it?

pragmus
  • 3,513
  • 3
  • 24
  • 46
  • 4
    You should look into using an actual Html parser, like HtmlAgilityPack. Using string.Split (or regex) is a bad idea. – gunr2171 Aug 27 '14 at 12:24
  • 1
    You're not _parsing_ anything. `` will result in ``. If you add more links you'll have even more garbage there. You have to use an HTML parser for that (and it won't consider links triggered from JavaScript). A raw solution MAY be to use a regex (note that you'll match URLs, you can't use regex to parse HTML) to find all URLs but then you have to clean that list (for example to drop POSTs, scripts, CSSs and so on). – Adriano Repetti Aug 27 '14 at 12:27

1 Answers1

0

I once did something similar. My solution was to change the html in a way that it fits the xml-regulations. (Here could be the problem with this solution, i believe my html was in some way predefined, so i only had to change a few thing which I knew are not xml conform in the html)

After this you could simple search the "a"-nodes and read the href param.

Unfortunately, I can't find my code anymore, it's too long ago.