I am using Spark XML to parse a large document that contains a few user-defined entities. This is a simple snippet from the file
<JMdict>
<entry>
<ent_seq>1000000</ent_seq>
<r_ele>
<reb>ヽ</reb>
</r_ele>
<sense>
<pos>&unc;</pos>
<gloss g_type="expl">repetition mark in katakana</gloss>
</sense>
<sense>
<gloss xml:lang="dut">hitotsuten 一つ点: teken dat herhaling van het voorafgaande katakana-schriftteken aangeeft</gloss>
</sense>
</entry>
</JMdict>
The entities are correctly defined in the inline DTD that can be found in the XML document, such as here
<!ENTITY unc "unclassified">
However, the parsing fails in the schema detection phase...
root
|-- _corrupt_record: string (nullable = true)
The reason seems to be the user-defined entities: when I escape them (such as &unc;
) everything works again.
root
|-- ent_seq: string (nullable = true)
|-- r_ele: struct (nullable = true)
| |-- reb: string (nullable = true)
|-- sense: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- gloss: struct (nullable = true)
| | | |-- _VALUE: string (nullable = true)
| | | |-- _g_type: string (nullable = true)
| | | |-- _lang: string (nullable = true)
| | |-- pos: string (nullable = true)
How can I address this?