2

I got HTML I need to parse, and I'm using C# and Html Agility Pack Library to do the selection of nodes. My html will look something like either:

<input data-translate-atrr-placeholder="FORGOT_PASSWORD.FORM.EMAIL">

or :

<h1 data-translate="FORGOT_PASSWORD.FORM.EMAIL"></h1>

where data-translate-attr-**** is the new pattern of attributes I need to find

I could use something like this :

//[contains(@??,'data-translate-attr')]

but unfortunately, that will only search for value INSIDE an attribute. How do I look for the attribute itself, with a wildcard?

Update : @Mathias Muller

HtmlAgilityPack.HtmlDocument htmlDoc    
// this is the old code (returns nodes)
var nodes = htmlDoc.DocumentNode.SelectNodes("//@data-translate");  
// these suggestions return no nodes using the same data
var nodes = htmlDoc.DocumentNode.SelectNodes("//@*[contains(name(),'data-translate')]");  
var nodes = htmlDoc.DocumentNode.SelectNodes("//@*[starts-with(name(),'data-translate')]");

Update 2

This appears to be an Html Agility Pack issue more than an XPath issue, I used chrome to test my XPath expressions and all of the following worked in chrome but not in Html Agility Pack :

//@*[contains(local-name(),'data-translate')]
//@*[starts-with(name(),'data-translate')]
//attribute::*[starts-with(local-name(.),'data-translate')]

My Solution

I ended up just doing things the old fashioned way...

var nodes = htmlDoc.DocumentNode.SelectNodes("//@*");

if (nodes != null) {
    foreach (HtmlNode node in nodes) {
        if (node.HasAttributes) {
            foreach (HtmlAttribute attr in node.Attributes) {
                if (attr.Name.StartsWith("data-translate")) {
                    // code in here to handle translation node
                }
            }
        }
    }
}
Ninjanoel
  • 2,864
  • 4
  • 33
  • 53

2 Answers2

2

rather than using name(), use local-name() such as:

var nodes = htmlDoc.DocumentNode.SelectNodes("//@*[starts-with(local-name(),'data-translate')]");

the difference is that name() should give you the attribute name with a prefix such as a namespace in xml, and local-name() will emit that prefix if its there, in your case name() and local-name() should work the same way because its html and there are no namespaces, but it seems that they don't and its probably a bug.

Test:

    var html = "<h3 x='foo'></h3>";
    var doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(html);
    var ElementByName = doc.DocumentNode.SelectSingleNode("//*[name()='h3']");                //Works
    var ElementByLocalName = doc.DocumentNode.SelectSingleNode("//*[local-name()='h3']");     //Works
    var ElementByAttributeLocalName = doc.DocumentNode.SelectSingleNode("//*[@*[local-name()='x']]"); //Works
    var ElementByAttributeName = doc.DocumentNode.SelectSingleNode("//*[@*[name()='x']]");  //Does NOT

    //Mathias Way
    var ElementByAttributeLocalName_ = doc.DocumentNode.SelectSingleNode("//@*[local-name() = 'x']"); //Works
    var ElementByAttributeName_ = doc.DocumentNode.SelectSingleNode("//@*[name() = 'x']");  //Does NOT
Xi Sigma
  • 2,292
  • 2
  • 13
  • 16
  • Thanks for the answer, I tested 'local-name()', and it works as well as 'name()' in this instance. – Ninjanoel Apr 23 '15 at 09:25
  • 1
    @Nnoel in `HtmlAgilityPack` only `local-name()` will work , `name()` should work the same way in case of html but there is a bug. – Xi Sigma Apr 23 '15 at 09:32
  • How do you know that's a bug in AgilityPack? Any reference to back this up? – Mathias Müller Apr 23 '15 at 10:18
  • @MathiasMüller because i know the problem `name()` never works in HAP i always use `local-name()` – Xi Sigma Apr 23 '15 at 10:20
  • I deleted my comment, because I read that attribute selection is not possible in HAP, is that true? Also, I found a post that supports your claim: http://htmlagilitypack.codeplex.com/workitem/35920. – Mathias Müller Apr 23 '15 at 11:24
  • @MathiasMüller if you try to select an attribute it will return the elemnt with that attribute, so your way works fine, but you get the element not the attribute – Xi Sigma Apr 23 '15 at 11:25
1

Use the XPath functions contains() or starts-with(). You need an XPath expression like

//@*[contains(name(),'data-translate')]

or perhaps

//@*[starts-with(name(),'data-translate')]

which actually retrieves attribute nodes. Above, the @* is the attribute wildcard you were looking for.

Mathias Müller
  • 22,203
  • 13
  • 58
  • 75
  • Thanks for the answer, I've updated the question, comparing what I have to what you've suggested. Am I making some silly mistake? – Ninjanoel Apr 22 '15 at 16:05
  • @Nnoel That's impossible, except there are namespaced attributes in your actual input document. Otherwise, it's an impossibilty that `//@data-translate` returns nodes, whereas `//@*[contains(name(),'data-translate')]` does not. Any essential information you have not given so far? Any simplification you've made to break down the problem? – Mathias Müller Apr 22 '15 at 17:44
  • Marking your answer correct, as your XPath is correct when tested with chrome. – Ninjanoel Apr 23 '15 at 09:23
  • @Nnoel The reason it does not work is that [HAP does not support attribute selection](http://stackoverflow.com/a/576172/1987598) - unfortunately. – Mathias Müller Apr 23 '15 at 11:17
  • Strange, I was using it to select attributes before, just 'advanced' wildcard selections were failing. Anyway, thanks for your help, after finding Html Agility Pack is the problem, I tried C#'s xml routines, but they don't work well with html, so back to Html Agility Pack. Posted my solution.. – Ninjanoel Apr 23 '15 at 13:09