0

I have an input file such as below and I am trying to extract its text and remove the html tags. Note that I want each p in newline but if it's br I would like to keep it in the same line but remove the br tag regardless.

<tt xmlns="http://www.w3.org/ns/ttml" xml:lang="en" xmlns:tts="http://www.w3.org/ns/ttml#parameter"><head><styling><style id="b1"/></sty    ling></head><body><div xml:lang="en" style="b1"><p begin="" end="0.143">HISTORY</p><p begin="0.143" end="0.286">HISTORY TV"</p><p begin=    "0.286" end="0.714">HISTORY TV" THIS</p><p begin="0.714" end="0.857">HISTORY TV" THIS<br/>WEEKEND</p><p begin="0.857" end="3">HISTORY TV    " THIS<br/>WEEKEND ON</p><p begin="3" end="3.333">HISTORY TV" THIS<br/>WEEKEND ON C-SPAN3.</p><p begin="3.333" end="3.667">WEEKEND ON C-    SPAN3.<br/>&gt;&gt;&gt;</p><p begin="3.667" end="4">WEEKEND ON C-SPAN3.<br/>&gt;&gt;&gt; "THE</p><p begin="4" end="4.5">WEEKEND ON C-SPA    N3.<br/>&gt;&gt;&gt; "THE MARCH</p><p begin="4.5" end="5">WEEKEND ON C-SPAN3.<br/>&gt;&gt;&gt; "THE MARCH ON</p><p begin="5" end="5.5">W    EEKEND ON C-SPAN3.<br/>&gt;&gt;&gt; "THE MARCH ON WASHINGTON"</p><p begin="5.5" end="5.667">&gt;&gt;&gt; "THE MARCH ON WASHINGTON"<br/>F    OR</p><p begin="5.667" end="5.833">&gt;&gt;&gt; "THE MARCH ON WASHINGTON"<br/>FOR JOBS</p><p begin="5.833" end="6">&gt;&gt;&gt; "THE MAR    CH ON WASHINGTON"<br/>FOR JOBS AND</p><p begin="6" end="6.2">&gt;&gt;&gt; "THE MARCH ON WASHINGTON"<br/>FOR JOBS AND FREEDOM</p><p begin    ="6.2" end="6.4">&gt;&gt;&gt; "THE MARCH ON WASHINGTON"<br/>FOR JOBS AND FREEDOM WAS</p><p begin="6.4" end="7">&gt;&gt;&gt; "THE MARCH O    N WASHINGTON"<br/>FOR JOBS AND FREEDOM WAS 49</p><p begin="7" end="8">FOR JOBS AND FREEDOM WAS 49<br/>YEARS</p><p begin="8" end="8.5">FO    R JOBS AND FREEDOM WAS 49<br/>YEARS AGO.</p><p begin="8.5" end="8.75">YEARS AGO.<br/>ON</p><p begin="8.75" end="9">YEARS AGO.<br/>ON AUG    UST</p><p begin="9" end="13">YEARS AGO.<br/>ON AUGUST 28th,</p><p begin="13" end="13.333">YEARS AGO.<br/>ON AUGUST 28th, 1963.</p><p beg    in="13.333" end="13.5">ON AUGUST 28th, 1963.<br/>THE</p><p begin="13.5" end="13.667">ON AUGUST 28th, 1963.<br/>THE MARCH</p><p begin="13    .667" end="13.833">ON AUGUST 28th, 1963.<br/>THE MARCH WAS</p><p begin="13.833" end="14">ON AUGUST 28th, 1963.<br/>THE MARCH WAS ORKGANI    ZED</p><p begin="14" end="14.167">ON AUGUST 28th, 1963.<br/>THE MARCH WAS ORKGANIZED TO</p><p begin="14.167" end="14.667">ON AUGUST 28th    , 1963.<br/>THE MARCH WAS ORKGANIZED TO PUSH</p><p begin="14.667" end="14.833">THE MARCH WAS ORKGANIZED TO PUSH<br/>FOR</p>

So in the end I would like to have

HISTORY
HISTORY TV"
HISTORY TV" THIS
HISTORY TV" THIS WEEKEND
HISTORY TV" THIS WEEKEND ON
HISTORY TV" THIS WEEKEND ON C-SPAN3.
...etc

How do I accomplish this task?

I used this code

import re
import os

def remove_html_tags(data):
    p = re.compile(r'<.*?>')
    return p.sub(' ', str(data)).strip()

directory = './reprocess'
for filename in os.listdir(directory):
    if filename.endswith(".dfxp"):
        print("Processing: {}".format(filename))
        with open("./reprocess/"+filename, "r") as inputFile:
            data = inputFile.read().splitlines()
            new_data = ""
            for line in data:
                new_data = new_data + remove_html_tags(line) + "\n"
        with open("./rmout/"+filename, "w") as text_file:
            text_file.write(new_data)

But it gave me a horrible output

HISTORY TV"
WEEKEND ON
HISTORY TV" THIS
 "THE MARCH
 "THE MARCH ON
WEEKEND ON C-SPAN3.
FOR JOBS AND FREEDOM WAS
 "THE MARCH ON WASHINGTON"
YEARS
FOR JOBS AND FREEDOM WAS 49
ON AUGUST 28th,
YEARS AGO.
THE MARCH WAS ORKGANIZED TO
ON AUGUST 28th, 1963.
FOR COMPREHENSIVE CIVIL
THE MARCH WAS ORKGANIZED TO PUSH
INCLUDING PUBLIC
FOR COMPREHENSIVE CIVIL RIGHTS
DESEGREGATION,
DESEGREGATION, VOTING
Elly Sapp
  • 1
  • 1
  • 2
    Do not use regular expressions to parse HTML files. Use `BeautifulSoup`. – DYZ Oct 06 '20 at 03:54
  • The
    tag can be evil in bs4 as indicated by comments in [this answer](https://stackoverflow.com/a/34640357/3218693). This `find_all-replace_with` approach somehow failed to remove `
    ` tags out for me (maybe the bs4 API changed?). I wonder if a clean and reliable solution exists as of 2020 (BeautifulSoup v4.9.1).
    – Bill Huang Oct 06 '20 at 08:50

2 Answers2

0

Digest

The code uses bs4 (BeautifulSoup4, official docs) and consists of 3 major steps:

  1. Pre-soup data cleansing: sometimes it is more convenient to clean some of the data in the raw text than in the soup. If this is the case, don't hesitate to do it.
  2. Construct the soup (DOM)
  3. Soup element extraction and post-processing the extracted text.

Code

Disclaimer: Test extensively and always expect exceptions. The problem solvers cannot foresee problems not appearing in the sample data.

import bs4
import re
from pprint import pprint

# raw data
html = "(as provided)"

# 1. cleansing

# (1) remove known unwanted patterns
html = html.replace("    ", "")
html = html.replace("&gt;&gt;&gt;", "")
# remove <br> tags (can also remove after the soup is built)
html = re.sub(r"<br\s*/?>", " ", html)  # careful! error-prone!

# (2) regularize multiple spaces
html = re.sub(r"\s{2,}", " ", html)

# 2. construct soup (DOM)
soup = bs4.BeautifulSoup(html, 'html.parser')

# 3. extract text in target elements    
ls_lines = []
for el in soup.find_all("p"):
    ls_lines.append(el.get_text().strip())

# check
for line in ls_lines:
    print(line)

Output

The output now looked really decent. HOWEVER this is because there is not much problems to find in a small sample dataset. In real-world cases, a lot more preprocessing and element selection rules is likely to be required. This part is out of the scope of this question.

HISTORY
HISTORY TV"
HISTORY TV" THIS
HISTORY TV" THIS WEEKEND
HISTORY TV" THIS WEEKEND ON
HISTORY TV" THIS WEEKEND ON C-SPAN3.
WEEKEND ON C-SPAN3.
WEEKEND ON C-SPAN3. "THE
WEEKEND ON C-SPAN3. "THE MARCH
WEEKEND ON C-SPAN3. "THE MARCH ON
WEEKEND ON C-SPAN3. "THE MARCH ON WASHINGTON"
"THE MARCH ON WASHINGTON" FOR
"THE MARCH ON WASHINGTON" FOR JOBS
"THE MARCH ON WASHINGTON" FOR JOBS AND
"THE MARCH ON WASHINGTON" FOR JOBS AND FREEDOM
"THE MARCH ON WASHINGTON" FOR JOBS AND FREEDOM WAS
"THE MARCH ON WASHINGTON" FOR JOBS AND FREEDOM WAS 49
FOR JOBS AND FREEDOM WAS 49 YEARS
FOR JOBS AND FREEDOM WAS 49 YEARS AGO.
YEARS AGO. ON
YEARS AGO. ON AUGUST
YEARS AGO. ON AUGUST 28th,
YEARS AGO. ON AUGUST 28th, 1963.
ON AUGUST 28th, 1963. THE
ON AUGUST 28th, 1963. THE MARCH
ON AUGUST 28th, 1963. THE MARCH WAS
ON AUGUST 28th, 1963. THE MARCH WAS ORKGANIZED
ON AUGUST 28th, 1963. THE MARCH WAS ORKGANIZED TO
ON AUGUST 28th, 1963. THE MARCH WAS ORKGANIZED TO PUSH
THE MARCH WAS ORKGANIZED TO PUSH FOR

References

Bill Huang
  • 4,491
  • 2
  • 13
  • 31
0

Is the input always TTML? If so, ttconv can split TTML/IMSC documents into a series of Intermediate Synchronic Documents (ISDs), each one corresponding to a period of time where the contents of the TTML/IMSC document is static. Text can easily be extracted from each ISD.

import ttconv.imsc.reader
import ttconv.isd
import xml.etree.ElementTree as et

tt_doc = """<?xml version="1.0" encoding="UTF-8"?>
  <tt xml:lang="fr" xmlns="http://www.w3.org/ns/ttml">
  <body>
    <div>
      <p begin="1s" end="2s">Hello</p>
      <p begin="3s" end="4s">Bonjour</p>
    </div>
  </body>
  </tt>"""

m = ttconv.imsc.reader.to_model(et.ElementTree(et.fromstring(tt_doc)))

st = ttconv.isd.ISD.significant_times(m)

for t in st:
  isd = ttconv.isd.ISD.from_model(m, t)
  
  # walk through all Text elements in `isd` to extract text

ttconv also supports conversion from TTML/IMSC to SRT, which is a simple text-based format.

tt.py convert -i <input .ttml file> -o <output .srt file> --otype SRT --itype TTML