Screen scraping with htmlAgilityPack and XPath

Question

[This question has a relative that lives at: Selective screen scraping with HTMLAgilityPack and XPath ]

I have some HTML to parse which has general appearance as follow:

...
<tr>
<td><a href="" title="">Text Data here (1)</a></td>
<td>Text Data here(2)</td>
<td>Text Data here(3)</td>
<td>Text Data here(4)</td>
<td>Text Data here(5)</td>
<td>Text Data here(6)</td>
<td><a href="link here {1}" class="image"><img alt="" src="" /></a></td>
</tr>
<tr>
<td><a href="" title="">Text Data here (1)</a></td>
<td>Text Data here(2)</td>
<td>Text Data here(3)</td>
<td>Text Data here(4)</td>
<td>Text Data here(5)</td>
<td>Text Data here(6)</td>
<td><a href="link here {1}" class="image"><img alt="" src="" /></a></td>
</tr>
...

I am looking for a way where I can parse it down in meaningful chunks like this:

(1), (2), (3), (4), (5), (6), {1}CRLF
(1), (2), (3), (4), (5), (6), {1}CRLF
and so on

I have tried two ways:
way 1:

var dataList = currentDoc.DocumentNode.Descendants("tr")
                .Select
                 (
                  tr => tr.Descendants("td").Select(td => td.InnerText).ToList()
                 ).ToList();

which fetches me the inner text of the tds, but fails to fetch the link {1}. Here, a list is created which contains a lot of lists. I can manage it using nested foreach.

way 2:

var dataList = currentDoc.DocumentNode
               .SelectNodes("//tr//td//text()|//tr//td//a//@href");

which does get me the link {1} and all data but it becomes unorganized. Here, all the data is present in big chunk. Since, the data in one tr is relative, I now loose that relation.

So, how can I solve this problem?

The (x) data in the TD and the {x} data in the HREF are different, so you need two piece of code to get it. What do you need exactly? — Simon Mourier, Mar 14 '13 at 09:10

Sergey Berezovskiy · Accepted Answer · 2013-03-14T09:33:16.707

0

Following query selects a element with non-empty href attribute from each cell. If there is no such element, then inner text of cell is used:

var dataList = 
     currentDoc.DocumentNode.Descendants("tr")
               .Select(tr => from td in tr.Descendants("td")
                             let a = td.SelectSingleNode("a[@href!='']")
                             select a == null ? td.InnerText : 
                                                a.Attributes["href"].Value);

Feel free to add ToList() calls.

edited Mar 14 '13 at 09:33

answered Mar 14 '13 at 09:12

Sergey Berezovskiy

232,247
41
429
459

`var dataList = currentDoc.DocumentNode.Descendants("tr") .Select(tr => from td in tr.Descendants("td") let a = td.SelectSingleNode("a[starts-with(@class,'image')]") select a == null ? td.InnerText : a.Attributes["href"].Value);` does it. It's quirky. But works. Thanks. – Mar 14 '13 at 11:58
any idea if I wanted to scrap data selectively? http://stackoverflow.com/questions/15404839/selective-screen-scraping-with-htmlagilitypack-and-xpath – Mar 14 '13 at 11:59
@AnubhavSaini hold on, I'll check – Sergey Berezovskiy Mar 14 '13 at 12:01

Screen scraping with htmlAgilityPack and XPath

1 Answers1

Linked