3

I have an XHTML file that is structured like this:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html>
<html lang="en">
<head>
...
</head>
<body>
...
</body>
<html>

I'm using BeautifulSoup and I want to remove the XML declaration from the document, so what I have looks like this:

<!DOCTYPE html>
<html lang="en">
<head>
...
</head>
<body>
...
</body>
<html>

I can't find a way to get at the XML declaration to remove it. It doesn't appear to be a Doctype, Declaration, Tag, or NavigableString as far as I can tell. Is there a way I can find this to extract it?

As a working example, I can remove the Doctype with code like this (assuming the document text is the variable "html"):

soup = BeautifulSoup(html)
[item.extract() for item in soup.contents if isinstance(item, Doctype)]
Jason Champion
  • 2,670
  • 4
  • 35
  • 55

2 Answers2

3

You could use the following approach:

import bs4

soup = bs4.BeautifulSoup(html, 'html.parser')

for e in soup:
    if isinstance(e, bs4.element.ProcessingInstruction):
        e.extract()
        break

print(soup)

For your sample, this would give you the updated HTML as:

<!DOCTYPE html>

<html lang="en">
<head>
...
</head>
<body>
...
</body>
<html></html></html>
Martin Evans
  • 45,791
  • 17
  • 81
  • 97
  • I'm getting `name 'bs4' is not defined`. Does your code stil work? Both of my solutions below produce strings, I want to keep the `bs` type, so I'd be very interested in knowing how to make your code work. :-) – PatrickT May 02 '20 at 08:45
  • Yes, this code still works fine. You will though need to install [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup) before you can use it, otherwise you will see your error message. – Martin Evans May 02 '20 at 08:51
  • You're right, your code works as is, thanks. I wasn't getting the desired result because I was using the `xml` parser rather than the `html.parser`. With the xml parser, a header is introduced (the `'bs4'` error is something stupid I must have done and irrelevant) – PatrickT May 02 '20 at 10:06
  • `NameError: name 'bs4' is not defined` is a result of forgetting to import the entire bs4 module: I typically just do `from bs4 import BeautifulSoup`. But here `import bs4` is needed. Funny how this error has come to bite me again one month later. But now I know why. :-) – PatrickT Jun 11 '20 at 12:53
0

Here is what worked for me in some very simple cases:

from bs4 import BeautifulSoup
s = "<a value='label'/>"
s = BeautifulSoup(s, 'xml')
print(s)
## <?xml version="1.0" encoding="utf-8"?>
## <a value="label"/>
  1. with bs syntax:

    s.decode_contents()
    ## '<a value="label"/>'
    
  2. with string.split:

    str(s).split("\n")[-1]
    ## '<a value="label"/>'
    
PatrickT
  • 10,037
  • 9
  • 76
  • 111
  • I used a self-closing tag in my example because my main purpose in using the `xml` parser instead of, say, `html.parser` or `html5lib`, was to be able to use self-closing tags. – PatrickT May 02 '20 at 08:48
  • `decode_contents()` may not remove `` if it is a string inside the htm... – PatrickT May 07 '20 at 21:20