Repairing XML documents - Python 3

Question

I have got around 600 XML documents which have to be parsed for certain processing. But they are not valid XML documents due to missing tags. The proper valid structure that they should have is-

<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <bdy>
   .....
   .....
   .....
  </bdy>
</article>

A single XML document contains hundreds of such <article>...</article> blocks. But the problem is certain such blocks have either the closing </bdy> or </article> tag missing, thereby rendering them useful to be parsed using Python modules such as- 'lxml', 'xml.dom', 'xml.etree.ElementTree', etc.

Also, since there are about 600 such files, a manual attempt to fix them seems about infeasible.

Any suggestion on how to handle them properly otherwise?

Thanks

"article.dtd" file can be downloaded as follows-

article.dtd

score 2 · Answer 1 · answered Dec 05 '18 at 22:31

2

You can make use of SGML tag inference to generate the missing end-element tags. Write a DTD file doc.dtd with the following content:

<!ELEMENT doc O O (article+)>
<!ELEMENT article - O (bdy)>
<!ELEMENT bdy - O (#PCDATA)>

telling SGML that the end-element tags for article and bdy, and both the start- and end-element tags for doc (an artificial container element for use as document element) can be omitted as per the O tag omission indicator for the respective element/tag.

Then insert the line

<!DOCTYPE doc SYSTEM "doc.dtd">

at the begin of the file(s) to be parsed.

Then install eg. OpenSP and invoke the osx program on the file(s) to produce well-formed XML.

See also Querying Non-XML compliant structured data for more details

answered Dec 05 '18 at 22:31

imhotap

2,275
1
8
16

There is a file "article.dtd" containing codes for character encoding set. For ex- <!ENTITY nbsp " ">, <!ENTITY iexcl "¡">, etc. Should I add the 3 lines of "doc.dtd" and to "article.dtd" file and then call osx program on this XML file? – Arun Dec 06 '18 at 07:24
@Arun yes, and if there are additional element and atrribute declarations in `article.dtd`, you should add them as well or adapt them accordingly, though it's difficult to tell without further info – imhotap Dec 06 '18 at 08:11
I added "article.dtd" file to question. When I run the command 'osx file334.xml > repaired_file334.xml' it gives me many errors- "osx:doc533.xml:362:184: entity was defined here osx:doc533.xml:644:18:E: reference to entity "nbsp" for which no system identifier could be generated" and many other errors – Arun Dec 06 '18 at 09:13
@Arun The error you're getting is because the .dtd file isn't being used so check your DOCTYPE declaration in line 1. I have just made up a minimal test file containing an article with omitted article and bdy end-elements and it's working fine when following the instructions outlined in the SO answer I linked. BUT you should use `osgmlnorm` rather than `osx` and you must change the line `OMITTAG NO` into `OMITTAG YES` in the SGML declaration in order to make SGML not complain about missing end-element tags. – imhotap Dec 06 '18 at 10:49

Repairing XML documents - Python 3

1 Answers1