I'm trying to parse some SGML like the following in Python:
<!DOCTYPE lewis SYSTEM "lewis.dtd">
<TEXT>
<TITLE>One</TITLE>
<BODY>Sample One</BODY>
</TEXT>
<TEXT>
<TITLE>Two</TITLE>
<BODY>Sample Two</BODY>
</TEXT>
Here, I'm just looking for everything inside the <BODY>
tags (i.e. ["Sample One", "Sample Two"]
).
I've tried using BeautifulSoup, but it doesn't like the <!DOCTYPE>
in the first line and also expects everything to be wrapped around a root tag like <everything></everything>
. While I can manually make these changes before passing it into BeautifulSoup, it feels a bit too hacky.
I'm pretty new to SGML, and also not married to BeautifulSoup, so I'm open to any suggestions.
(For those curious: my specific usecase is the reuters21578 dataset.)