0

I am using Spark XML to parse a large document that contains a few user-defined entities. This is a simple snippet from the file

<JMdict>
    <entry>
        <ent_seq>1000000</ent_seq>
        <r_ele>
            <reb>ヽ</reb>
        </r_ele>
        <sense>
            <pos>&unc;</pos>
            <gloss g_type="expl">repetition mark in katakana</gloss>
        </sense>
        <sense>
            <gloss xml:lang="dut">hitotsuten 一つ点: teken dat herhaling van het voorafgaande katakana-schriftteken aangeeft</gloss>
        </sense>
    </entry>
</JMdict>

The entities are correctly defined in the inline DTD that can be found in the XML document, such as here

<!ENTITY unc "unclassified">

However, the parsing fails in the schema detection phase...

root
 |-- _corrupt_record: string (nullable = true)

The reason seems to be the user-defined entities: when I escape them (such as &amp;unc;) everything works again.

root
 |-- ent_seq: string (nullable = true)
 |-- r_ele: struct (nullable = true)
 |    |-- reb: string (nullable = true)
 |-- sense: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- gloss: struct (nullable = true)
 |    |    |    |-- _VALUE: string (nullable = true)
 |    |    |    |-- _g_type: string (nullable = true)
 |    |    |    |-- _lang: string (nullable = true)
 |    |    |-- pos: string (nullable = true)

How can I address this?

fedmest
  • 709
  • 5
  • 17
  • Sadly, there are many (so-called) XML parsers that do not support use of parsed entity references. – Michael Kay Sep 20 '19 at 21:05
  • @MichaelKay is there any way to access the underlying spark-xml parser so that I can configure it to use my entities, or just change the parser used by spark-xml so that I can chose one that works with them? – fedmest Sep 23 '19 at 14:33
  • No, sorry, I have no idea what parser it uses. I was just pointing out that lack of entity support is not uncommon. I took a quick look at the Spark XML documentation and couldn't find any information, which is itself a bit of a bad sign. – Michael Kay Sep 23 '19 at 21:10
  • Thanks @MichaelKay I also researched the docs and had a quick look at source, but without much joy – fedmest Sep 24 '19 at 18:12

1 Answers1

0

Yes, it's not going to do things like read ENTITY directives. The reason is that you really can't throw a regular XML parser at huge amounts of XML - or if you can, well, no need for Spark or spark-xml really.

What spark-xml does is 'parse' the XML only enough to find the few subsets of it that you are interested in, then passes that on to a full-fledges XML parser (STaX). So, within your row tag, XML should be parsed correctly. However ENTITY would be at the root of the document, so STaX won't see it.

Indeed, the use case here isn't even one big doc, but many, that could have different directives even.

Sean Owen
  • 66,182
  • 23
  • 141
  • 173