1

look at this example:

# xml parser
bs4.BeautifulSoup('<price>&pound;4</price>', 'xml')

# prints:
<?xml version="1.0" encoding="utf-8"?>
<price>4</price>
# html (lxml) parser
bs4.BeautifulSoup('<span>&pound;4</span>', 'lxml')

# prints:
<html><body><span>£4</span></body></html>

Notice the £ sign. Why the XML parser removes it? What should I do to have it in the output? I need xml parsing, because the document contains some paired tags which are wrongly parsed by lxml parser (e.g. <link>).

uiii
  • 469
  • 1
  • 7
  • 19

1 Answers1

0

The &pound; is not a standard XML entity - use for example &#163; instead. &pound; is a HTML entity, and can't be used without declaring (or embedding) them in a DTD.

Edit: See for example How do I define HTML entity references inside a valid XML document?

Community
  • 1
  • 1
Trondster
  • 116
  • 1
  • 5
  • HI, thanks for reply. You are probably right, but `BS` stil not paring it correctly `bs4.BeautifulSoup(' ]>£4', 'xml')` prints ` 4` – uiii Apr 14 '16 at 08:46
  • ..Could you use `£` in the input instead, or massaging the input HTML in some other way? – Trondster Apr 15 '16 at 10:09