0

I have a large number of HTML files some of which contain a section starting "<div> specific text" and ending with </div>. I'd like to use a bash script to remove these sections.

There are many other div sections, some of which overlap with the one I'm interested in.

I'd like to go through each file, outputting to a new file until reaching the start of the specific section; continue, incrementing and decrementing a counter with each <div> or </div> until the counter reaches zero, then resume outputting the file.

What are the most suitable approaches to use for this purpose? Speed isn't a priority.

Or is there a better way?

Sample input html:

<html>
<head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>Re: Something</title>
<link rel="important stylesheet" href="">
<style>div.headerdisplayname {font-weight:bold;}</style></head>
<body>
<table border=0 cellspacing=0 cellpadding=0 width="100%" class="header-part1"><tr><td><div class="headerdisplayname" style="display:inline;">Subject: </div>Re: Something</td></tr><tr><td><div class="headerdisplayname" style="display:inline;">From: </div>sender@isp.com></td></tr><tr><td><div class="headerdisplayname" style="display:inline;">Date: </div>06/12/16 15:18</td></tr></table><table border=0 cellspacing=0 cellpadding=0 width="100%" class="header-part2"><tr><td><div class="headerdisplayname" style="display:inline;">To: </div>Sender 2 <sender2@isp.com></td></tr></table><br>
<div class="moz-text-html"  lang="x-unicode"><div dir="ltr">Dear Sender 2<div><br></div><div>A message</div><div><br></div><div>Mesage 1</div><div><ul><li>Message 2<br></li><li>Message 3<br></li><li>Message 4</li><li>Message 5<br></li><li>Message 6<br></li><li>Message 7<br></li><li>Message 8<br></li><li>Message 9<br></li></ul></div><div>Message 10</div><div><br></div><div>Message 11</div><div><br></div><div>Message 12</div><div><br></div><div>Message 13</div><div><br></div><div>Sender</div><div><br></div><div><br></div><div><br></div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On 5 December 2016 at 19:20, Sender 2 <span dir="ltr">&lt;<a href="mailto:sender2@isp.com" target="_blank">sender2@isp.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><font color="black" face="Arial, Helvetica, sans-serif">
<div style="font-family:arial,helvetica;font-size:10pt;color:black">
<div>
<font color="black" face="Arial, Helvetica, sans-serif">Dear Sender 1,
<div><br>
</div>
<div>Reply 1</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div>Reply 2</div>
<div><br>
</div>
<div>Sender 2</div><span class="HOEnZb"><font color="#888888">
<div><br>
</div>
<div>Sender 2</div>
<div>+telephone</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
</font></span></font>
</div>
</div>
</font></blockquote></div><br></div>
</div></body>
</html>
</table></div>

Removing the

<div class="gmail_quote">

section gives output:

<html>
<head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>Re: Something</title>
<link rel="important stylesheet" href="">
<style>div.headerdisplayname {font-weight:bold;}</style></head>
<body>
<table border=0 cellspacing=0 cellpadding=0 width="100%" class="header-part1"><tr><td><div class="headerdisplayname" style="display:inline;">Subject: </div>Re: Something</td></tr><tr><td><div class="headerdisplayname" style="display:inline;">From: </div>sender@isp.com></td></tr><tr><td><div class="headerdisplayname" style="display:inline;">Date: </div>06/12/16 15:18</td></tr></table><table border=0 cellspacing=0 cellpadding=0 width="100%" class="header-part2"><tr><td><div class="headerdisplayname" style="display:inline;">To: </div>Sender 2 <sender2@isp.com></td></tr></table><br>
<div class="moz-text-html"  lang="x-unicode"><div dir="ltr">Dear Sender 2<div><br></div><div>A message</div><div><br></div><div>Mesage 1</div><div><ul><li>Message 2<br></li><li>Message 3<br></li><li>Message 4</li><li>Message 5<br></li><li>Message 6<br></li><li>Message 7<br></li><li>Message 8<br></li><li>Message 9<br></li></ul></div><div>Message 10</div><div><br></div><div>Message 11</div><div><br></div><div>Message 12</div><div><br></div><div>Message 13</div><div><br></div><div>Sender</div><div><br></div><div><br></div><div><br></div><div><br></div></div><div class="gmail_extra"><br>
</div></body>
</html>
</table></div>

Note that there may be multiple nested sections to remove.

  • 5
    Is "use a proper parser" an option, or not? If so, this is probably a [so] question. – Michael Homer Jan 28 '17 at 06:54
  • To just remove the text is easy enough with something like: `cat test.html | sed 's#
    remove this
    ##g'`.
    – Tigger Jan 28 '17 at 08:11
  • 3
    1) We can't help unless you show us an example of your input and the output you would want to see from it. 2) You will need to use a dedicated HTML parser and write a script. – terdon Jan 28 '17 at 13:01
  • Tigger's suggestion won't work because of repeated nested sections, all of which end in <\div>. I need to find the matching end. – user173283 Jan 28 '17 at 15:10
  • do they have to be in the same line these strings? Maybe an example of output would useful in order to understand the problem. – fusion.slope Jan 28 '17 at 16:54
  • @terdon I've posted the before and after code, by manually following my proposed code - hope that helps clarify. – user173283 Jan 28 '17 at 17:14
  • @fusion.slope Assume the strings can be anywhere in the file. – user173283 Jan 28 '17 at 17:14
  • How about some specific responses? I tried Pup [link](https://github.com/ericchiang/pup): >cat input.html | pup ':not(div.gmail_quote)' > output.html but that returned the entire html unchanged whereas: >cat input.html | pup 'div.gmail_quote' > output.html – user173283 Jan 29 '17 at 09:00
  • How about some specific responses? I tried Pup [link](https://github.com/ericchiang/pup): cat input.html | pup ':not(div.gmail_quote)' > output.html but that returned the entire html unchanged whereas: cat input.html | pup 'div.gmail_quote' > output.html returned the unwanted section Perhaps someone could suggest a suitable tool? – user173283 Jan 29 '17 at 09:07
  • It's a pity this isn't well formed XML. Would have been really easy to do with e.g. XMLStarlet otherwise. – Kusalananda Jan 29 '17 at 09:55

0 Answers0