11

I'm trying to retrieve a specific image from a html document, using html agility pack and this xpath:

//div[@id='topslot']/a/img/@src

As far as I can see, it finds the src-attribute, but it returns the img-tag. Why is that?

I would expect the InnerHtml/InnerText or something to be set, but both are empty strings. OuterHtml is set to the complete img-tag.

Are there any documentation for Html Agility Pack?

Vegar
  • 12,828
  • 16
  • 85
  • 151

7 Answers7

16

You can directly grab the attribute if you use the HtmlNavigator instead.

//Load document from some html string
HtmlDocument hdoc = new HtmlDocument();
hdoc.LoadHtml(htmlContent);

//Load navigator for current document
HtmlNodeNavigator navigator = (HtmlNodeNavigator)hdoc.CreateNavigator();

//Get value from given xpath
string xpath = "//div[@id='topslot']/a/img/@src";
string val = navigator.SelectSingleNode(xpath).Value;
grepfruit
  • 181
  • 9
Pierluc SS
  • 3,138
  • 7
  • 31
  • 44
  • 1
    While this works for reading the attribute's value it is not possible to modify it. Calling `.SetValue("new_value")` on the selected attribute node throws a `System.NotSupportedException` since the returned `HtmlNodeNavigator` is **read-only**. – Andre Aug 10 '15 at 12:59
  • Isn't this answer a direct contradiction to the accepted answer (modification was not part of the question)? – David S. Oct 04 '16 at 11:07
  • @DavidS. I guess the OP just never bothered switching it since I added this answer roughly 4 years later – Pierluc SS Oct 04 '16 at 17:16
13

Html Agility Pack does not support attribute selection.

Azat Razetdinov
  • 988
  • 7
  • 7
8

You may use the method "GetAttributeValue".

Example:

//[...] code before needs to load a html document
HtmlAgilityPack.HtmlDocument htmldoc = e.Document;
//get all nodes "a" matching the XPath expression
HtmlNodeCollection AllNodes = htmldoc.DocumentNode.SelectNodes("*[@class='item']/p/a");
//show a messagebox for each node found that shows the content of attribute "href"
foreach (var MensaNode in AllNodes)
{
     string url = MensaNode.GetAttributeValue("href", "not found");
     MessageBox.Show(url);
}
Ben
  • 81
  • 1
  • 2
1

Html Agility Pack will support it soon.

http://htmlagilitypack.codeplex.com/Thread/View.aspx?ThreadId=204342

Almas
  • 144
  • 2
  • Here's the updated link: http://web.archive.org/web/20110109221024/http://htmlagilitypack.codeplex.com/Thread/View.aspx?ThreadId=204342 And it doesn't seem to have turned into a release. – ygoe Oct 30 '18 at 08:23
1

Reading and Writing Attributes with Html Agility Pack

You can both read and set the attributes in HtmlAgilityPack. This example selects the < html> tag and selects the 'lang' (language) attribute if it exists and then reads and writes to the 'lang' attribute.

In the example below, the doc.LoadHtml(this.All), "this.All" is a string representation of a html document.

Read and write:

            HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
            doc.LoadHtml(this.All);
            string language = string.Empty;
            var nodes = doc.DocumentNode.SelectNodes("//html");
            for (int i = 0; i < nodes.Count; i++)
            {
                if (nodes[i] != null && nodes[i].Attributes.Count > 0 && nodes[i].Attributes.Contains("lang"))
                {
                    language = nodes[i].Attributes["lang"].Value; //Get attribute
                    nodes[i].Attributes["lang"].Value = "en-US"; //Set attribute
                }
            }

Read only:

            HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
            doc.LoadHtml(this.All);
            string language = string.Empty;
            var nodes = doc.DocumentNode.SelectNodes("//html");
            foreach (HtmlNode a in nodes)
            {
                if (a != null && a.Attributes.Count > 0 && a.Attributes.Contains("lang"))
                {
                    language = a.Attributes["lang"].Value;
                }
            }
CodeBon
  • 1,134
  • 8
  • 9
0

I used the following way to obtain the attributes of an image.

var MainImageString  = MainImageNode.Attributes.Where(i=> i.Name=="src").FirstOrDefault();

You can specify the attribute name to get its value; if you don't know the attribute name, give a breakpoint after you have fetched the node and see its attributes by hovering over it.

Hope I helped.

Abhay Shiro
  • 3,431
  • 2
  • 16
  • 26
0

I just faced this problem and solved it using GetAttributeValue method.

//Selecting all tbody elements
IList<HtmlNode> nodes = doc.QuerySelectorAll("div.characterbox-main")[1]
.QuerySelectorAll("div table tbody");

//Iterating over them and getting the src attribute value of img elements.
var data = nodes.Select((node) =>
{
     return new
     {
         name = node.QuerySelector("tr:nth-child(2) th a").InnerText,
         imageUrl = node.QuerySelector("tr td div a img")
         .GetAttributeValue("src", "default-url")
     };
});
MertStack
  • 51
  • 2
  • This looks good, but it also looks like you're not just using HtmlAgilityPack, but rather Fizzler, which includes it. – Davak72 Aug 01 '23 at 13:59