How to detect all relative URLs within an HTML webpage?

Question

As the question states; is there some way to detect all URLs inside a PHP page if they're relative. And by considering of course that the URLs contained in the PHP Page may appear in different behaviors :

<link rel="stylesheet" href="/lib/css/hanv2/ie.css" />
<img src="/image.jpg">
<div style="background-image: url(/lib/data/emotion-header-v2/int-algemeen08.jpg)"></div>

So i need to get the relative URL no matter what's its bihavior css link, js link, image link, swf link

I'm using AgilityPack for this, and here is some C# code snippest that i used to detect links and check whether they're relative :

      // to extract all a href tags
 private List<string> ExtractAllAHrefTags(HtmlAgilityPack.HtmlDocument htmlSnippet)
    {
        List<string> hrefTags = new List<string>();

        foreach (HtmlNode link in htmlSnippet.DocumentNode.SelectNodes("//link[@href]"))
        {
            HtmlAttribute att = link.Attributes["href"];
            hrefTags.Add(att.Value);
        }

        return hrefTags;
    }


    // to extract all img src tags
    private List<string> ExtractAllImgTags(HtmlAgilityPack.HtmlDocument htmlSnippet)
    {
        List<string> hrefTags = new List<string>();

        foreach (HtmlNode link in htmlSnippet.DocumentNode.SelectNodes("//img[@src]"))
        {
            HtmlAttribute att = link.Attributes["src"];
            hrefTags.Add(att.Value);
        }

        return hrefTags;
    }




       //to check whether path is relative       
            foreach (string s in AllHrefTags)
            {                  
                if (!s.StartsWith("http://") || !s.StartsWith("https://"))
                {
                    // path is not relative
                }
            }

I'm wondering if there is a good or a more accurate way to get all relative paths from a given HTML page using AgilityPack or something else in a short way

HtmlAgility pack can't properly parse PHP source and even if it would it unlikly to contain rendered links... Are you sure that you need to parse PHP, but not HTML that is produced by some server side code (which maybe PHP)? — Alexei Levenkov, Jan 18 '13 at 21:16

Amine Hajyoussef · Answer 1 · 2013-01-21T15:41:14.420

you can use this xpath expression to extract relative urls from a html page which are href or src values:

htmlSnippet.DocumentNode.SelectNodes("(//@src|//@href)[not(starts-with(.,'http://'))][not(starts-with(.,'https://'))]");

you might want to filter links that start with # wich are used to jump to a specific location on the current page,(ex :< a href="#tips">) :

    htmlSnippet.DocumentNode.SelectNodes("(//@src|//@href)[not(starts-with(.,'http://'))][not(starts-with(.,'https://'))][not(starts-with(.,'#'))]");

How to detect all relative URLs within an HTML webpage?

1 Answers1