Python: Parsing SGML

Question

I'm trying to parse some SGML like the following in Python:

<!DOCTYPE lewis SYSTEM "lewis.dtd">
<TEXT>
    <TITLE>One</TITLE>
    <BODY>Sample One</BODY>
</TEXT>
<TEXT>
    <TITLE>Two</TITLE>
    <BODY>Sample Two</BODY>
</TEXT>

Here, I'm just looking for everything inside the <BODY> tags (i.e. ["Sample One", "Sample Two"]).

I've tried using BeautifulSoup, but it doesn't like the <!DOCTYPE> in the first line and also expects everything to be wrapped around a root tag like <everything></everything>. While I can manually make these changes before passing it into BeautifulSoup, it feels a bit too hacky.

I'm pretty new to SGML, and also not married to BeautifulSoup, so I'm open to any suggestions.

(For those curious: my specific usecase is the reuters21578 dataset.)

What parser are you using with beautiful soup ? – Anand S Kumar Jul 29 '15 at 08:03 — Anand S Kumar, Jul 29 '15 at 08:03
I'm using "lxml-xml", as recommended in one of the docs. – scip Jul 29 '15 at 08:04 — scip, Jul 29 '15 at 08:04

Anand S Kumar · Answer 1 · 2015-07-29T08:49:43.960

5

You can try using 'html.parser' as the parser instead of lxml-xml. lxml-xml would expect the text to be correct xml , which is not the case.

Example/Demo -

>>> from bs4 import BeautifulSoup
>>> s = """<!DOCTYPE lewis SYSTEM "lewis.dtd">
... <TEXT>
...     <TITLE>One</TITLE>
...     <BODY>Sample One</BODY>
... </TEXT>
... <TEXT>
...     <TITLE>Two</TITLE>
...     <BODY>Sample Two</BODY>
... </TEXT>"""
>>> soup = BeautifulSoup(s,'html.parser')
>>> soup.find_all('body')
[<body>Sample One</body>, <body>Sample Two</body>]

edited Jul 29 '15 at 08:49

answered Jul 29 '15 at 08:06

Anand S Kumar

88,551
18
188
176

Thanks! Is the only way to make this valid XML to strip out the first line and wrap everything in a dummy root element? I'm interested in using lxml primarily for performance reasons (also, is there no standard SGML parser)? – scip Jul 29 '15 at 08:11
I do not think BeautifulSoup inbuilt has sgml parser, `'lxml'` may also work for you , that is `lxml html` parser (not the xml version) . More on the parsers supported - http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser – Anand S Kumar Jul 29 '15 at 08:21

Python: Parsing SGML

1 Answers1