how to use html agility pack to extract all url from html text

Question

Often I extract file names from html text data using regex but I heard the html agility pack is good for parsing html data. how can I use html agility pack to extract all url from html data. Can any one guide me with sample code. Thanks.

This is my code sample which works fine.

using System.Text.RegularExpressions;

private ArrayList GetFilesName(string Source)
{
    ArrayList arrayList = new ArrayList();
    Regex regex = new Regex("(?<=src=\")([^\"]+)(?=\")", 1);
    MatchCollection matchCollection = regex.Matches(Source);
    foreach (Match match in matchCollection)
    {
        if (!match.get_Value().StartsWith("http://"))
        {
                    arrayList.Add(Path.GetFileName(match.get_Value()));
                }
                match.NextMatch();
            }
            ArrayList arrayList1 = arrayList;
            return arrayList1;
        }

private string ReplaceSrc(string Source)
{
    Regex regex = new Regex("(?<=src=\")([^\"]+)(?=\")", 1);
    MatchCollection matchCollection = regex.Matches(Source);
    foreach (Match match in matchCollection)
    {
        string value = match.get_Value();
        string str = string.Concat("images/", Path.GetFileName(value));
        Source = Source.Replace(value, str);
        match.NextMatch();
    }
    string source = Source;
    return source;
}

score 2 · Answer 1 · answered Mar 12 '13 at 15:19

2

Something like:

var doc = new HtmlDocument();
doc.LoadHtml(html);

var images = doc.DocumentNode.Descendants("img")
    .Where(i => i.GetAttributeValue("src", null) != null)
    .Select(i => i.Attributes["src"].Value);

This selects all the <img> elements from the document which have src property set, and return these URLs.

answered Mar 12 '13 at 15:19

Oleks

31,955
11
77
132

1

Why not just `doc.DocumentNode.Descendants("img").Select(i => i.GetAttributeValue("src", null))`, With a `Where(url=>url!=null)` if you *must not* have nulls in your enumerable? – RoadieRich Mar 13 '13 at 11:16

score 0 · Accepted Answer · answered Mar 12 '13 at 15:31

0

Select all img tags with non-empty src attribute (otherwise you will get NullReferenceException during getting attribute value):

HtmlDocument html = new HtmlDocument();
html.Load(path_to_file);
var urls = html.DocumentNode.SelectNodes("//img[@src!='']")
               .Select(i => i.Attributes["src"].Value);

answered Mar 12 '13 at 15:31

Sergey Berezovskiy

232,247
41
429
459

1

this sample is bit friendly – Mou Mar 14 '13 at 07:00

how to use html agility pack to extract all url from html text

2 Answers2