3

I'm trying to parse some SGML like the following in Python:

<!DOCTYPE lewis SYSTEM "lewis.dtd">
<TEXT>
    <TITLE>One</TITLE>
    <BODY>Sample One</BODY>
</TEXT>
<TEXT>
    <TITLE>Two</TITLE>
    <BODY>Sample Two</BODY>
</TEXT>

Here, I'm just looking for everything inside the <BODY> tags (i.e. ["Sample One", "Sample Two"]).

I've tried using BeautifulSoup, but it doesn't like the <!DOCTYPE> in the first line and also expects everything to be wrapped around a root tag like <everything></everything>. While I can manually make these changes before passing it into BeautifulSoup, it feels a bit too hacky.

I'm pretty new to SGML, and also not married to BeautifulSoup, so I'm open to any suggestions.

(For those curious: my specific usecase is the reuters21578 dataset.)

scip
  • 153
  • 1
  • 8

1 Answers1

5

You can try using 'html.parser' as the parser instead of lxml-xml. lxml-xml would expect the text to be correct xml , which is not the case.

Example/Demo -

>>> from bs4 import BeautifulSoup
>>> s = """<!DOCTYPE lewis SYSTEM "lewis.dtd">
... <TEXT>
...     <TITLE>One</TITLE>
...     <BODY>Sample One</BODY>
... </TEXT>
... <TEXT>
...     <TITLE>Two</TITLE>
...     <BODY>Sample Two</BODY>
... </TEXT>"""
>>> soup = BeautifulSoup(s,'html.parser')
>>> soup.find_all('body')
[<body>Sample One</body>, <body>Sample Two</body>]
Anand S Kumar
  • 88,551
  • 18
  • 188
  • 176
  • Thanks! Is the only way to make this valid XML to strip out the first line and wrap everything in a dummy root element? I'm interested in using lxml primarily for performance reasons (also, is there no standard SGML parser)? – scip Jul 29 '15 at 08:11
  • I do not think BeautifulSoup inbuilt has sgml parser, `'lxml'` may also work for you , that is `lxml html` parser (not the xml version) . More on the parsers supported - http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser – Anand S Kumar Jul 29 '15 at 08:21