BeautifulSoup (bs4) XML parser removes html entities

Question

look at this example:

# xml parser
bs4.BeautifulSoup('<price>&pound;4</price>', 'xml')

# prints:
<?xml version="1.0" encoding="utf-8"?>
<price>4</price>

# html (lxml) parser
bs4.BeautifulSoup('<span>&pound;4</span>', 'lxml')

# prints:
<html><body><span>£4</span></body></html>

Notice the £ sign. Why the XML parser removes it? What should I do to have it in the output? I need xml parsing, because the document contains some paired tags which are wrongly parsed by lxml parser (e.g. <link>).

Do you have to use the xml parser? – Padraic Cunningham Apr 14 '16 at 19:24 — Padraic Cunningham, Apr 14 '16 at 19:24

score 0 · Answer 1 · edited May 23 '17 at 12:31

0

The £ is not a standard XML entity - use for example £ instead. £ is a HTML entity, and can't be used without declaring (or embedding) them in a DTD.

Edit: See for example How do I define HTML entity references inside a valid XML document?

edited May 23 '17 at 12:31

Community

1
1

answered Apr 13 '16 at 12:10

Trondster

116
1
5

HI, thanks for reply. You are probably right, but `BS` stil not paring it correctly `bs4.BeautifulSoup(' ]>£4', 'xml')` prints ` 4` – uiii Apr 14 '16 at 08:46
..Could you use `£` in the input instead, or massaging the input HTML in some other way? – Trondster Apr 15 '16 at 10:09

BeautifulSoup (bs4) XML parser removes html entities

1 Answers1