3

I am parsing USPTO patents from 2001 in SGML format. At top of each file, an external DTD is referenced. Unfortunately, this DTD seems to be invalid. A validity check confirms that:

Line 361
Error: A '(' character or an element type is required within declaration of element type "ADR".
<!ELEMENT ADR  - - (OMC?,STR*,CITY?,CNTY?,STATE?,CTRY?,PCODE?,EAD*,TEL*,FAX* ...

However, I do not need to validate the SGML files to be processed. I just need the SGML parser to be aware of the entities. Currently, I am using Python with the LXML library. I call the XMLParser as follows:

parser = etree.XMLParser(target=SimpleXMLHandler(), resolve_entities=False, load_dtd=dtd, dtd_validation=False, recover=True)  

But still, I am getting immediately the error that the external DTD is invalid in line 361. How can I avoid that issue? I am not the implementor of the DTD, so I am not willing to repair it.

Regards!

labrassbandito
  • 535
  • 12
  • 25
  • 1
    Take a look at my answer to another question about differences between XML and SGML. http://stackoverflow.com/questions/4231135/strategy-for-parsing-lots-and-lots-of-not-so-well-formed-sgml-xml-documents/4231758#4231758 – Daniel Haley Jul 04 '11 at 19:30
  • XML DTDs are not the same as SGML DTDs, and you're using an XML parser, which can't cope with the freedoms SGML provides, primarily because SGML allows things like optional end tags (<p> in HTML for example) whereas all XML tags must be properly closed. –  Jul 05 '11 at 02:49

1 Answers1

5

As Chrono Kitsune already noted: the problem lies with xml versus sgml: the DTD is not a correct xml dtd, because it is an sgml dtd.

I'd suggest converting the sgml documents to xml first, for example using sx.

Steven
  • 28,002
  • 5
  • 61
  • 51