I have a large number of HTML files some of which contain a section starting "<div>
specific text" and ending with </div>
. I'd like to use a bash script to remove these sections.
There are many other div
sections, some of which overlap with the one I'm interested in.
I'd like to go through each file, outputting to a new file until reaching the start of the specific section; continue, incrementing and decrementing a counter with each <div>
or </div>
until the counter reaches zero, then resume outputting the file.
What are the most suitable approaches to use for this purpose? Speed isn't a priority.
Or is there a better way?
Sample input html:
<html>
<head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>Re: Something</title>
<link rel="important stylesheet" href="">
<style>div.headerdisplayname {font-weight:bold;}</style></head>
<body>
<table border=0 cellspacing=0 cellpadding=0 width="100%" class="header-part1"><tr><td><div class="headerdisplayname" style="display:inline;">Subject: </div>Re: Something</td></tr><tr><td><div class="headerdisplayname" style="display:inline;">From: </div>sender@isp.com></td></tr><tr><td><div class="headerdisplayname" style="display:inline;">Date: </div>06/12/16 15:18</td></tr></table><table border=0 cellspacing=0 cellpadding=0 width="100%" class="header-part2"><tr><td><div class="headerdisplayname" style="display:inline;">To: </div>Sender 2 <sender2@isp.com></td></tr></table><br>
<div class="moz-text-html" lang="x-unicode"><div dir="ltr">Dear Sender 2<div><br></div><div>A message</div><div><br></div><div>Mesage 1</div><div><ul><li>Message 2<br></li><li>Message 3<br></li><li>Message 4</li><li>Message 5<br></li><li>Message 6<br></li><li>Message 7<br></li><li>Message 8<br></li><li>Message 9<br></li></ul></div><div>Message 10</div><div><br></div><div>Message 11</div><div><br></div><div>Message 12</div><div><br></div><div>Message 13</div><div><br></div><div>Sender</div><div><br></div><div><br></div><div><br></div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On 5 December 2016 at 19:20, Sender 2 <span dir="ltr"><<a href="mailto:sender2@isp.com" target="_blank">sender2@isp.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><font color="black" face="Arial, Helvetica, sans-serif">
<div style="font-family:arial,helvetica;font-size:10pt;color:black">
<div>
<font color="black" face="Arial, Helvetica, sans-serif">Dear Sender 1,
<div><br>
</div>
<div>Reply 1</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div>Reply 2</div>
<div><br>
</div>
<div>Sender 2</div><span class="HOEnZb"><font color="#888888">
<div><br>
</div>
<div>Sender 2</div>
<div>+telephone</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
<div><br>
</div>
</font></span></font>
</div>
</div>
</font></blockquote></div><br></div>
</div></body>
</html>
</table></div>
Removing the
<div
class="gmail_quote">
section gives output:
<html>
<head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>Re: Something</title>
<link rel="important stylesheet" href="">
<style>div.headerdisplayname {font-weight:bold;}</style></head>
<body>
<table border=0 cellspacing=0 cellpadding=0 width="100%" class="header-part1"><tr><td><div class="headerdisplayname" style="display:inline;">Subject: </div>Re: Something</td></tr><tr><td><div class="headerdisplayname" style="display:inline;">From: </div>sender@isp.com></td></tr><tr><td><div class="headerdisplayname" style="display:inline;">Date: </div>06/12/16 15:18</td></tr></table><table border=0 cellspacing=0 cellpadding=0 width="100%" class="header-part2"><tr><td><div class="headerdisplayname" style="display:inline;">To: </div>Sender 2 <sender2@isp.com></td></tr></table><br>
<div class="moz-text-html" lang="x-unicode"><div dir="ltr">Dear Sender 2<div><br></div><div>A message</div><div><br></div><div>Mesage 1</div><div><ul><li>Message 2<br></li><li>Message 3<br></li><li>Message 4</li><li>Message 5<br></li><li>Message 6<br></li><li>Message 7<br></li><li>Message 8<br></li><li>Message 9<br></li></ul></div><div>Message 10</div><div><br></div><div>Message 11</div><div><br></div><div>Message 12</div><div><br></div><div>Message 13</div><div><br></div><div>Sender</div><div><br></div><div><br></div><div><br></div><div><br></div></div><div class="gmail_extra"><br>
</div></body>
</html>
</table></div>
Note that there may be multiple nested sections to remove.