Remove HTML MarkUp

Question

I am automating a marking procedure for a python class. However, when I download the submissions online they include the html markup which the students may have inadvertently submitted their solutions in such as:

<!DOCTYPE html><html><head><meta charset="UTF-8"></head><body><p><span style="font-family:'courier new', courier, monospace;">print("Bob and Bill Tiling Solutions Inc.")</span></p>
<p><span style="font-family:'courier new', courier, monospace;">h=int(input("Height   (m):"))</span></p>
<p><span style="font-family:'courier new', courier, monospace;">w=int(input("Width    (m):"))</span></p>
<p><span style="font-family:'courier new', courier, monospace;">p=int(input("Cost ($/m^2):"))</span></p>
<p><span style="font-family:'courier new', courier, monospace;">print("The total cost for this job: $" + str(h*w*p+20))</span></p>
<p> </p></body></html>

Is there any way I can remove the mark-up in batch so that all that is left is:

print("Bob and Bill Tiling Solutions Inc.")
h=int(input("Height   (m):"))
w=int(input("Width    (m):"))
p=int(input("Cost ($/m^2):"))
print("The total cost for this job: $" + str(h*w*p+20))

If there is a third-party utility that does this I would be happy to download it.

I have tried using regular expressions through findstr with no avail (My search string is "<[^>]*>" but I do not know how to use findstr to remove all results in the text file)

Any suggestions are welcome.

score 1 · Accepted Answer · answered Mar 15 '15 at 04:46

1

Here's a SED script (I use GNUSED) which I adapted from Eric Pement's SED One-liners:

the sed line

sed -f dehtml.sed yourfilename

The file dehtml.sed

:a
s/<[^>]*>//g;/</N;//ba

answered Mar 15 '15 at 04:46

Magoo

77,302
8
62
84

Thanks. I'll download it – Monacraft Mar 15 '15 at 05:56

Remove HTML MarkUp

1 Answers1