I have an input file such as below and I am trying to extract its text and remove the html tags. Note that I want each p in newline but if it's br I would like to keep it in the same line but remove the br tag regardless.
<tt xmlns="http://www.w3.org/ns/ttml" xml:lang="en" xmlns:tts="http://www.w3.org/ns/ttml#parameter"><head><styling><style id="b1"/></sty ling></head><body><div xml:lang="en" style="b1"><p begin="" end="0.143">HISTORY</p><p begin="0.143" end="0.286">HISTORY TV"</p><p begin= "0.286" end="0.714">HISTORY TV" THIS</p><p begin="0.714" end="0.857">HISTORY TV" THIS<br/>WEEKEND</p><p begin="0.857" end="3">HISTORY TV " THIS<br/>WEEKEND ON</p><p begin="3" end="3.333">HISTORY TV" THIS<br/>WEEKEND ON C-SPAN3.</p><p begin="3.333" end="3.667">WEEKEND ON C- SPAN3.<br/>>>></p><p begin="3.667" end="4">WEEKEND ON C-SPAN3.<br/>>>> "THE</p><p begin="4" end="4.5">WEEKEND ON C-SPA N3.<br/>>>> "THE MARCH</p><p begin="4.5" end="5">WEEKEND ON C-SPAN3.<br/>>>> "THE MARCH ON</p><p begin="5" end="5.5">W EEKEND ON C-SPAN3.<br/>>>> "THE MARCH ON WASHINGTON"</p><p begin="5.5" end="5.667">>>> "THE MARCH ON WASHINGTON"<br/>F OR</p><p begin="5.667" end="5.833">>>> "THE MARCH ON WASHINGTON"<br/>FOR JOBS</p><p begin="5.833" end="6">>>> "THE MAR CH ON WASHINGTON"<br/>FOR JOBS AND</p><p begin="6" end="6.2">>>> "THE MARCH ON WASHINGTON"<br/>FOR JOBS AND FREEDOM</p><p begin ="6.2" end="6.4">>>> "THE MARCH ON WASHINGTON"<br/>FOR JOBS AND FREEDOM WAS</p><p begin="6.4" end="7">>>> "THE MARCH O N WASHINGTON"<br/>FOR JOBS AND FREEDOM WAS 49</p><p begin="7" end="8">FOR JOBS AND FREEDOM WAS 49<br/>YEARS</p><p begin="8" end="8.5">FO R JOBS AND FREEDOM WAS 49<br/>YEARS AGO.</p><p begin="8.5" end="8.75">YEARS AGO.<br/>ON</p><p begin="8.75" end="9">YEARS AGO.<br/>ON AUG UST</p><p begin="9" end="13">YEARS AGO.<br/>ON AUGUST 28th,</p><p begin="13" end="13.333">YEARS AGO.<br/>ON AUGUST 28th, 1963.</p><p beg in="13.333" end="13.5">ON AUGUST 28th, 1963.<br/>THE</p><p begin="13.5" end="13.667">ON AUGUST 28th, 1963.<br/>THE MARCH</p><p begin="13 .667" end="13.833">ON AUGUST 28th, 1963.<br/>THE MARCH WAS</p><p begin="13.833" end="14">ON AUGUST 28th, 1963.<br/>THE MARCH WAS ORKGANI ZED</p><p begin="14" end="14.167">ON AUGUST 28th, 1963.<br/>THE MARCH WAS ORKGANIZED TO</p><p begin="14.167" end="14.667">ON AUGUST 28th , 1963.<br/>THE MARCH WAS ORKGANIZED TO PUSH</p><p begin="14.667" end="14.833">THE MARCH WAS ORKGANIZED TO PUSH<br/>FOR</p>
So in the end I would like to have
HISTORY
HISTORY TV"
HISTORY TV" THIS
HISTORY TV" THIS WEEKEND
HISTORY TV" THIS WEEKEND ON
HISTORY TV" THIS WEEKEND ON C-SPAN3.
...etc
How do I accomplish this task?
I used this code
import re
import os
def remove_html_tags(data):
p = re.compile(r'<.*?>')
return p.sub(' ', str(data)).strip()
directory = './reprocess'
for filename in os.listdir(directory):
if filename.endswith(".dfxp"):
print("Processing: {}".format(filename))
with open("./reprocess/"+filename, "r") as inputFile:
data = inputFile.read().splitlines()
new_data = ""
for line in data:
new_data = new_data + remove_html_tags(line) + "\n"
with open("./rmout/"+filename, "w") as text_file:
text_file.write(new_data)
But it gave me a horrible output
HISTORY TV"
WEEKEND ON
HISTORY TV" THIS
"THE MARCH
"THE MARCH ON
WEEKEND ON C-SPAN3.
FOR JOBS AND FREEDOM WAS
"THE MARCH ON WASHINGTON"
YEARS
FOR JOBS AND FREEDOM WAS 49
ON AUGUST 28th,
YEARS AGO.
THE MARCH WAS ORKGANIZED TO
ON AUGUST 28th, 1963.
FOR COMPREHENSIVE CIVIL
THE MARCH WAS ORKGANIZED TO PUSH
INCLUDING PUBLIC
FOR COMPREHENSIVE CIVIL RIGHTS
DESEGREGATION,
DESEGREGATION, VOTING
tag can be evil in bs4 as indicated by comments in [this answer](https://stackoverflow.com/a/34640357/3218693). This `find_all-replace_with` approach somehow failed to remove `
` tags out for me (maybe the bs4 API changed?). I wonder if a clean and reliable solution exists as of 2020 (BeautifulSoup v4.9.1). – Bill Huang Oct 06 '20 at 08:50