How to remove blank lines from HTML with HTMLAgilityPack?

Question

I have a HTML document that contains lots of needless blank lines which I'd like to remove. Here's a sample of the HTML:

<html>

<head>


</head>

<body>

<h1>Heading</h1>

<p>Testing

I've tried the following code but it removed every newline, I just want to remove the ones that are blank lines.

static string RemoveLineReturns(string html)
    {
        html = html.Replace(Environment.NewLine, "");
        return html;
    }

Any idea how to do this with HTMLAgilityPack? Thanks, J.

http://stackoverflow.com/questions/7647716/how-to-remove-empty-lines-from-a-formatted-string — Xi Sigma, Apr 03 '15 at 14:46
You want to remove the blank lines or the nodes which are empty? — Rahul Tripathi, Apr 03 '15 at 14:46
Does this help: http://stackoverflow.com/questions/8743344/remove-whitespaces-and-newlines-when-parsing-with-htmlagilitypack ? — Rahul Tripathi, Apr 03 '15 at 14:51

score 5 · Answer 1 · answered Apr 04 '15 at 01:23

One possible way using Html Agility Pack :

var doc = new HtmlDocument();
//TODO: load your HtmlDocument here

//select all empty (containing white-space(s) only) text nodes :
var xpath = "//text()[not(normalize-space())]";
var emptyNodes = doc.DocumentNode.SelectNodes(xpath);

//replace each and all empty text nodes with single new-line text node
foreach (HtmlNode emptyNode in emptyNodes)
{
    emptyNode.ParentNode
             .ReplaceChild(HtmlTextNode.CreateNode(Environment.NewLine) 
                            , emptyNode
                           );
}

For use with a SQL query, I found that I had to use an empty string "" instead of Environment.NewLine: emptyNode.ParentNode.ReplaceChild(HtmlTextNode.CreateNode(""), emptyNode); — Rocky Raccoon, Dec 20 '18 at 14:38

score 2 · Accepted Answer · answered Apr 03 '15 at 14:51

2

I don't think that HTMLAgilityPack currently features a native solution for that.

For such scenarios I use the following Regex:

html = Regex.Replace(html, @"( |\t|\r?\n)\1+", "$1");

This preserves whitespaces and line endings correctly, while condensing multiple tabs, newlines and whitespaces into one.

answered Apr 03 '15 at 14:51

Darkseal

9,205
8
78
111

Worked really well, nice and simple solution. Thanks Darkseal! – bearaman Apr 08 '15 at 10:08
I tried the same way but didn't work for me, I made below small change and worked. Regex.Replace(html, @"( |\t|\r|\n)+", string.Empty) – Jenish Zinzuvadiya Sep 23 '20 at 11:18

How to remove blank lines from HTML with HTMLAgilityPack?

2 Answers2