Removing useless TextNodes in HtmlAgilityPack

Question

I'm scraping a number of websites using HtmlAgilityPack. The problem is that it seems to insist on inserting TextNodes in most places which are either empty or just contain a mass of \n, whitespaces and \r.

They tend to cause me issues when I'm counting childnodes , since firebug doesn't show them, but HtmlAgilityPack does.

Is there a way of telling HtmlAgilityPack to stop doing it, or at least clearing out these textnodes? (I want to keep USEFUL ones though). While we're here, same thing for Comment and Script tags.

score 2 · Answer 1 · answered Sep 03 '17 at 16:50

You can use the following extension method:

static class HtmlNodeExtensions
{
    public static List<HtmlNode> GetChildNodesDiscardingTextOnes(this HtmlNode node)
    {
        return node.ChildNodes.Where(n => n.NodeType != HtmlNodeType.Text).ToList();
    }
}

And call it like this:

List<HtmlNode> nodes = someNode.GetChildNodesDiscardingTextOnes();

score 0 · Answer 2 · edited May 23 '17 at 12:16

0

There is a difference between "no whitespace" between two nodes and "some whitespace". So all-whitespace textnodes still are needed and significant.

Couldn't you preprocess the html and remove all nodes that you do not need, before starting the "real scraping"?

See also this answer for the "how to remove".

edited May 23 '17 at 12:16

Community

1
1

answered Jul 05 '12 at 08:55

Hans Keﬆing

38,117
9
79
111

score 0 · Answer 3 · answered Jul 05 '12 at 09:12

0

Create an extension method that operates on the "Child" collection (or similar) on a node that uses some LINQ to filter out unwanted nodes. Then, when you traverse your tree do something like this:

myNode.Children.FilterNodes().ForEach(x => {});

answered Jul 05 '12 at 09:12

Onkelborg

3,927
1
19
22

score 0 · Answer 4 · answered Feb 24 '17 at 01:41

I am looking for a better answer. Here is my current method with respect to childnodes like tables rows and table cells. Nodes are identified by their name TR, TH, TD so I strip out #text every time.

List<HtmlNode> rows = table.ChildNodes.Where(w => w.Name != "#text").ToList();

Sure, it is tedious and works and could be improved by an extension.

Removing useless TextNodes in HtmlAgilityPack

4 Answers4