I have got around 600 XML documents which have to be parsed for certain processing. But they are not valid XML documents due to missing tags. The proper valid structure that they should have is-
<article xmlns:xlink="http://www.w3.org/1999/xlink">
<bdy>
.....
.....
.....
</bdy>
</article>
A single XML document contains hundreds of such <article>...</article>
blocks. But the problem is certain such blocks have either the closing </bdy>
or </article>
tag missing, thereby rendering them useful to be parsed using Python modules such as- 'lxml', 'xml.dom', 'xml.etree.ElementTree', etc.
Also, since there are about 600 such files, a manual attempt to fix them seems about infeasible.
Any suggestion on how to handle them properly otherwise?
Thanks
"article.dtd" file can be downloaded as follows-