Selective screen scraping with HTMLAgilityPack and XPath

Question

[This question has a relative that lives at: Screen scraping with htmlAgilityPack and XPath ]

I have some HTML to parse which has general appearance as follow:

...
<tr>
<td><a href="" title="">Text Data here (1)</a></td>
<td>Text Data here(2)</td>
<td>Text Data here(3)</td>
<td>Text Data here(4)</td>
<td>Text Data here(5)</td>
<td>Text Data here(6)</td>
<td><a href="link here {1}" class="image"><img alt="" src="" /></a></td>
</tr>
<tr>
<td><a href="" title="">Text Data here (1)</a></td>
<td>Text Data here(2)</td>
<td>Text Data here(3)</td>
<td>Text Data here(4)</td>
<td>Text Data here(5)</td>
<td>Text Data here(6)</td>
<td><a href="link here {1}" class="image"><img alt="" src="" /></a></td>
</tr>
...

I am looking for a way where I can parse it down in meaningful chunks but I would like to have selective data like first two td data and last two td-data:

(1), (2), (6), {1}CRLF
(1), (2), (6), {1}CRLF
and so on

I have tried two ways: way 1:

var dataList = currentDoc.DocumentNode.Descendants("tr")
            .Select
             (
              tr => tr.Descendants("td").Select(td => td.InnerText).ToList()
             ).ToList();

which fetches me the inner text of the tds, but fails to fetch the link {1}. Here, a list is created which contains a lot of lists. I can manage it using nested foreach.

way 2:

var dataList = currentDoc.DocumentNode
           .SelectNodes("//tr//td//text()|//tr//td//a//@href");

which does get me the link {1} and all data but it becomes unorganized. Here, all the data is present in big chunk. Since, the data in one tr is relative, I now loose that relation.

So, how can I get the data that I am interested in, only the first two columns and last two columns data?

score 0 · Accepted Answer · answered Mar 14 '13 at 12:15

0

Following code will select first two <td> data and last two <td> nodes data:

html.DocumentNode.Descendants("tr")
    .Select(tr => 
       from td in tr.SelectNodes("td[position() < 3 or position() > last() - 2]")
       let a = td.SelectSingleNode("a[@href!='']")
       select a == null ? td.InnerText : a.Attributes["href"].Value);

This xpath is filtering nodes by position:

td[position() < 3 or position() > last() - 2]

answered Mar 14 '13 at 12:15

Sergey Berezovskiy

232,247
41
429
459

@AnubhavSaini I tested this code on your sample html, works just fine and returns four strings for each row – Sergey Berezovskiy Mar 14 '13 at 12:50
1

okay. It might work. I have corrupt data so can't check it. but it looks right enough. – Mar 14 '13 at 13:20

Selective screen scraping with HTMLAgilityPack and XPath

1 Answers1

Linked