Strip Doctype from HTML using Beautifulsoup4?

Question

I'm new to Python, and BeautifulSoup so bear with me...

I'm trying figure out how to remove the Doctype from an HTML file using Beautifulsoup4, but can't seem to figure out exactly how to achieve this.

def saveToText(self):
    filename = os.path.join(self.parent.ReportPath, str(self.parent.CharName.text()) + "_report.txt")
    filename, filters = QFileDialog.getSaveFileName(self, "Save Report", filename, "Text (*.txt);;All Files (*.*)")

    if filename is not None and str(filename) != '':

        try:
            if re.compile('\.txt$').search(str(filename)) is None:
                filename = str(filename)
                filename += '.txt'

            soup = BeautifulSoup(self.reportHtml, "lxml")

            try:  # THROWS AttributeError IF NOT FOUND ..
                soup.find('font').extract()
            except AttributeError:
                pass

            try:  # THROWS AttributeError IF NOT FOUND ..
                soup.find('head').extract()

            except AttributeError:
                pass

            soup.html.unwrap()
            soup.body.unwrap()

            for b in soup.find_all('b'):
                b.unwrap()

            for table in soup.find_all('table'):
                table.unwrap()

            for td in soup.find_all('td'):
                td.unwrap()

            for br in soup.find_all('br'):
                br.replace_with('\n')

            for center in soup.find_all('center'):
                center.insert_after('\n')

            for dl in soup.find_all('dl'):
                dl.insert_after('\n')

            for dt in soup.find_all('dt'):
                dt.insert_after('\n')

            for hr in soup.find_all('hr'):
                hr.replace_with(('-' * 80) + '\n')

            for tr in soup.find_all('tr'):
                tr.insert_before('  ')
                tr.insert_after('\n')

            print(soup)

        except IOError:
            QMessageBox.critical(None, 'Error!', 'Error writing to file: ' + filename, 'OK')

I tried using:

from bs4 import Doctype

if isinstance(e, Doctype):
    e.extract()

but that complains that 'e' is a unresolved reference. I've searched through the documentation and google, but I haven't found anything that works.

On a side note, is there a way to reduce this code?

@SamChats I didn't which I'm sure is why that is happening, but the example I was working off of didn't either. I'm not really sure what 'e' is supposed to be defined as. The documentation for Beautifulsoup is pretty decent, but it really wasn't enough information for me to go off of. — artomason, Nov 08 '17 at 05:42
@SamChats tried popping soup in there; the error is gone, but the Doctype remains in my output. — artomason, Nov 08 '17 at 05:45
Then the error is in the way you're trying to remove it, in the upper part of your code. By the way, why are you replacing `tr`s? — Sam Chats, Nov 08 '17 at 05:59
@SamChats the HTML this is breaking down organizes it data by table rows, so I'm adding an indent before the data and a line break after. I have found quite a few examples on how to pull the Doctype, but nothing to really remove it. The closest was https://stackoverflow.com/questions/33207503/how-do-i-remove-an-xml-declaration-using-beautifulsoup4 — artomason, Nov 08 '17 at 06:03

score 3 · Accepted Answer · answered Nov 08 '17 at 06:49

3

This seemed to correct the problem perfectly.

from bs4 import BeautifulSoup, Doctype

for item in soup.contents:
    if isinstance(item, Doctype):
        item.extract()

answered Nov 08 '17 at 06:49

artomason

3,625
5
20
43

This worked great! Thanks for posting. – Thom Ives Aug 25 '23 at 00:44

Strip Doctype from HTML using Beautifulsoup4?

1 Answers1