23

How would I effectively parse the href attribute value from this :

<tr>
<td rowspan="1" colspan="1">7</td>
<td rowspan="1" colspan="1">
<a class="undMe" href="/ice/player.htm?id=8475179" rel="skaterLinkData" shape="rect">D. Kulikov</a>
</td>
<td rowspan="1" colspan="1">D</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">0</td>
<td rowspan="1" colspan="1">0</td>
[...]

I am interested in having the player id, which is: 8475179 Here is the code I have so far:

        // Iterate all rows (players)
        for (int i = 1; i < rows.Count; ++i)
        {
            HtmlNodeCollection cols = rows[i].SelectNodes(".//td");

            // new player
            Dim_Player player = new Dim_Player();

                // Iterate all columns in this row
                for (int j = 1; j < 6; ++j)
                {
                    switch (j) {
                        case 1: player.Name = cols[j].InnerText;
                                player.Player_id = Int32.Parse(/* this is where I want to parse the href value */); 
                                break;
                        case 2: player.Position = cols[j].InnerText; break;
                        case 3: stats.Goals = Int32.Parse(cols[j].InnerText); break;
                        case 4: stats.Assists = Int32.Parse(cols[j].InnerText); break;
                        case 5: stats.Points = Int32.Parse(cols[j].InnerText); break;
                    }
                }
Jean-François Beaulieu
  • 4,305
  • 22
  • 74
  • 107

2 Answers2

39

Based on your example this worked for me:

HtmlDocument htmlDoc = new HtmlDocument();
htmlDoc.Load("test.html");
var link = htmlDoc.DocumentNode
                  .Descendants("a")
                  .First(x => x.Attributes["class"] != null 
                           && x.Attributes["class"].Value == "undMe");

string hrefValue = link.Attributes["href"].Value;
long playerId = Convert.ToInt64(hrefValue.Split('=')[1]);

For real use you need to add error checking etc.

BrokenGlass
  • 158,293
  • 28
  • 286
  • 335
  • Works for me too! Is it just me or this is rather inconvenient because we have to use `htmlDoc` in which we find all nodes with class 'undMe', while instead we could use `cols[j]` which have the `href` in it's InnerHtml ? – Jean-François Beaulieu Dec 13 '11 at 23:36
  • You are making a very strong assumption about where your link is located - this might work fine but is very rigid and will break, i.e. if you add another column. The presented approach wouldn't since its *querying* for the link on minimal assumptions – BrokenGlass Dec 13 '11 at 23:47
  • Actually, the only problem with this is the `First()` which is static and always brings the first element he finds. I need something dynamic that can get the actual element. – Jean-François Beaulieu Dec 14 '11 at 00:37
  • Ahhh... Found it: `var link = cols.Descendants("a").First();` since I only want to search in the columns that I have already found. – Jean-François Beaulieu Dec 14 '11 at 01:42
  • this is awesome answer ... it worked perfect. The only issue is replace .First by .FirstOrDefault , otherwise it will throw an exception. – Zia Ur Rahman Jan 13 '16 at 16:31
  • Great answer. +1 – AndyUK Apr 06 '17 at 08:51
4

Use an XPath expression to find it:

 foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@class='undMe']"))
 {
      HtmlAttribute att = link.Attributes["href"];
      Console.WriteLine(new Regex(@"(?<=[\?&]id=)\d+(?=\&|\#|$)").Match(att.Value).Value);
 }
csharptest.net
  • 62,602
  • 11
  • 71
  • 89