1

Currently working on a url extractor for work. I'm trying to extract all http links/ href links from a downloaded html file and print the links on there own in a separate txt file.So far I've managed to get the entire html of a page downloaded its just extracting the links from it and printing them using Regex is a problem. Wondering if anyone could help me with this?

     private void button2_Click(object sender, EventArgs e)
    {
        Uri fileURI = new Uri(URLbox2.Text);

        WebRequest request = WebRequest.Create(fileURI);
        request.Credentials = CredentialCache.DefaultCredentials;
        WebResponse response = request.GetResponse();
        Console.WriteLine(((HttpWebResponse)response).StatusDescription);
        Stream dataStream = response.GetResponseStream();
        StreamReader reader = new StreamReader(dataStream);
        string responseFromServer = reader.ReadToEnd();

        SW = File.CreateText("C:\\Users\\Conal_Curran\\OneDrive\\C#\\MyProjects\\Web Crawler\\URLTester\\response1.htm");
        SW.WriteLine(responseFromServer);

        SW.Close();

        string text = System.IO.File.ReadAllText(@"C:\\Users\\Conal_Curran\\OneDrive\\C#\\MyProjects\\Web Crawler\\URLTester\\response1.htm");
        string[] links = System.IO.File.ReadAllLines(@"C:\\Users\\Conal_Curran\\OneDrive\\C#\\MyProjects\\Web Crawler\\URLTester\\response1.htm");



        Regex regx = new Regex(links, @"http://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\@\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?", RegexOptions.IgnoreCase);

        MatchCollection mactches = regx.Matches(text);

        foreach (Match match in mactches)
        {
            text = text.Replace(match.Value, "<a href='" + match.Value + "'>" + match.Value + "</a>");
        }

        SW = File.CreateText("C:\\Users\\Conal_Curran\\OneDrive\\C#\\MyProjects\\Web Crawler\\URLTester\\Links.htm");
        SW.WriteLine(links);
    }
Conall Curran
  • 61
  • 1
  • 10
  • question 1) Have you tried googling the problem? – d0nut Dec 29 '15 at 15:56
  • 1
    Could you clarify this sentence? It's not 100% clear to me what is working and which part is blocking you: "So far I've managed to get the entire html of a page downloaded its just extracting the links from it and prining them using Regex is a problem." – Starceaker Dec 29 '15 at 15:56
  • Your code example doesn't remotely look like it's trying to do what you're saying you want to do. Did you copy and paste this from somewhere? – d0nut Dec 29 '15 at 15:57
  • Is using a Regex a must? The HTML Agility Pack makes this quite easy, see [here.](https://stackoverflow.com/questions/25688847/html-agility-pack-get-all-urls-on-page) – haddow64 Dec 29 '15 at 16:00
  • @Starceaker I'm able to download and entire html web page and within that web page are a number of href links that I am trying to extract and print into a separate txt file? – Conall Curran Dec 29 '15 at 16:02
  • @haddow64 No Regex is not a must Ill have a look at the HTML Agility pack thanks. – Conall Curran Dec 29 '15 at 16:04
  • When you say you have a problem you mean you don't know how? I thought you had a certain exception message or something else blocking you. – Starceaker Dec 29 '15 at 16:04
  • @iismathwizard Yes I've had a look on google but was unsuccessful in finding an answer, I wrote this code myself but with the help of an example from C# tutorial. – Conall Curran Dec 29 '15 at 16:05
  • @ConallCurran reading the bottom section. You're doing a replace in the webpage that takes a URL and puts it **into** a link. This does not sound like what you want. – d0nut Dec 29 '15 at 16:08
  • @Starceaker the error code I get is "cannot convert 'string' to 'System.Text.RegularExpressions.RegexOptions' as you can see I'm trying pass links through Regex however links is classed as a string. I don't know how to convert links in order for regex to read the content and match href or tags. – Conall Curran Dec 29 '15 at 16:09
  • @ConallCurran btw when you use `@` in C# you do **not** need to escape anything in the string. So there is no reason to write `"\\w"` just write `@"\w"` – d0nut Dec 29 '15 at 16:11
  • @ConallCurran `@"http://([\w+?\.\w+])+([a-zA-Z0-9\~\!\@\#\$\%\^\&\*\(\)_\-\=\+\\\/\?\.\:\;\'\,]"` – d0nut Dec 29 '15 at 16:13

1 Answers1

3

In case you do not know, this can be achieved (pretty easily) using one of the html parser nuget packages available.

I personally use HtmlAgilityPack (along with ScrapySharp, another package) and AngleSharp.

With only the 3 lines above, you have all the hrefs in the document loaded by your http get request, using HtmlAgilityPack:

/*
  do not forget to include the usings:
  using HtmlAgilityPack;
  using ScrapySharp.Extensions;
*/

HtmlWeb w = new HtmlWeb();
//since you have your html locally stored, you do the following:

//P.S: By prefixing file path strings with @, you are rid of having to escape slashes and other fluffs.
var doc = HtmlDocument.LoadHtml(@"C:\Users\Conal_Curran\OneDrive\C#\MyProjects\Web Crawler\URLTester\response1.htm");

//for an http get request
//var doc = w.Load("yourAddressHere");
var hrefs = doc.DocumentNode.CssSelect("a").Select(a => a.GetAttributeValue("href"));
Veverke
  • 9,208
  • 4
  • 51
  • 95
  • Simple solution that is far more effective than using any regular expression. +1 – d0nut Dec 29 '15 at 16:04
  • 1
    I must however start slowly moving all my code to AngleSharp, since HtmlAgilityPack is not maintained anymore. – Veverke Dec 29 '15 at 16:05