How Do I Remove An XML Declaration Using BeautifulSoup4

Question

I have an XHTML file that is structured like this:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html>
<html lang="en">
<head>
...
</head>
<body>
...
</body>
<html>

I'm using BeautifulSoup and I want to remove the XML declaration from the document, so what I have looks like this:

<!DOCTYPE html>
<html lang="en">
<head>
...
</head>
<body>
...
</body>
<html>

I can't find a way to get at the XML declaration to remove it. It doesn't appear to be a Doctype, Declaration, Tag, or NavigableString as far as I can tell. Is there a way I can find this to extract it?

As a working example, I can remove the Doctype with code like this (assuming the document text is the variable "html"):

soup = BeautifulSoup(html)
[item.extract() for item in soup.contents if isinstance(item, Doctype)]

Martin Evans · Accepted Answer · 2020-06-11T13:03:52.687

3

You could use the following approach:

import bs4

soup = bs4.BeautifulSoup(html, 'html.parser')

for e in soup:
    if isinstance(e, bs4.element.ProcessingInstruction):
        e.extract()
        break

print(soup)

For your sample, this would give you the updated HTML as:

<!DOCTYPE html>

<html lang="en">
<head>
...
</head>
<body>
...
</body>
<html></html></html>

edited Jun 11 '20 at 13:03

answered Oct 19 '15 at 06:25

Martin Evans

45,791
17
81
97

I'm getting `name 'bs4' is not defined`. Does your code stil work? Both of my solutions below produce strings, I want to keep the `bs` type, so I'd be very interested in knowing how to make your code work. :-) – PatrickT May 02 '20 at 08:45
Yes, this code still works fine. You will though need to install [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup) before you can use it, otherwise you will see your error message. – Martin Evans May 02 '20 at 08:51
You're right, your code works as is, thanks. I wasn't getting the desired result because I was using the `xml` parser rather than the `html.parser`. With the xml parser, a header is introduced (the `'bs4'` error is something stupid I must have done and irrelevant) – PatrickT May 02 '20 at 10:06
`NameError: name 'bs4' is not defined` is a result of forgetting to import the entire bs4 module: I typically just do `from bs4 import BeautifulSoup`. But here `import bs4` is needed. Funny how this error has come to bite me again one month later. But now I know why. :-) – PatrickT Jun 11 '20 at 12:53

PatrickT · Answer 2 · 2020-05-07T22:10:51.217

0

Here is what worked for me in some very simple cases:

from bs4 import BeautifulSoup
s = "<a value='label'/>"
s = BeautifulSoup(s, 'xml')
print(s)
## <?xml version="1.0" encoding="utf-8"?>
## <a value="label"/>

with bs syntax:

s.decode_contents()
## '<a value="label"/>'

with string.split:

str(s).split("\n")[-1]
## '<a value="label"/>'

edited May 07 '20 at 22:10

answered May 02 '20 at 08:41

PatrickT

10,037
9
76
111

I used a self-closing tag in my example because my main purpose in using the `xml` parser instead of, say, `html.parser` or `html5lib`, was to be able to use self-closing tags. – PatrickT May 02 '20 at 08:48
`decode_contents()` may not remove `` if it is a string inside the htm... – PatrickT May 07 '20 at 21:20

How Do I Remove An XML Declaration Using BeautifulSoup4

2 Answers2

Linked