1

I'm attempting to parse a large Japanese to English dictionary written in XML. A typical entry looks like this:

<entry>
<ent_seq>1486440</ent_seq>
<k_ele>
<keb>美術</keb>
<ke_pri>ichi1</ke_pri>
<ke_pri>news1</ke_pri>
<ke_pri>nf02</ke_pri>
</k_ele>
<r_ele>
<reb>びじゅつ</reb>
<re_pri>ichi1</re_pri>
<re_pri>news1</re_pri>
<re_pri>nf02</re_pri>
</r_ele>
<sense>
<pos>&n;</pos>
<pos>&adj-no;</pos>
<gloss>art</gloss>
<gloss>fine arts</gloss>
</sense>
<sense>
<gloss xml:lang="dut">kunst</gloss>
<gloss xml:lang="dut">schone kunsten</gloss>
</sense>
<sense>
<gloss xml:lang="fre">art</gloss>
<gloss xml:lang="fre">beaux-arts</gloss>
</sense>
<sense>
<gloss xml:lang="ger">Kunst</gloss>
<gloss xml:lang="ger">die schönen Künste</gloss>
<gloss xml:lang="ger">bildende Kunst</gloss>
</sense>
<sense>
<gloss xml:lang="ger">Produktionsdesign</gloss>
<gloss xml:lang="ger">Szenographie</gloss>
</sense>
<sense>
<gloss xml:lang="hun">művészet</gloss>
<gloss xml:lang="hun">művészeti</gloss>
<gloss xml:lang="hun">művészi</gloss>
<gloss xml:lang="hun">rajzóra</gloss>
<gloss xml:lang="hun">szépművészet</gloss>
</sense>
<sense>
<gloss xml:lang="rus">изящные искусства; искусство</gloss>
<gloss xml:lang="rus">{~{的}} художественный, артистический</gloss>
</sense>
<sense>
<gloss xml:lang="slv">umetnost</gloss>
<gloss xml:lang="slv">likovna umetnost</gloss>
</sense>
<sense>
<gloss xml:lang="spa">bellas artes</gloss>
</sense>
</entry>

I've written a deserialiser based on code provided by djv in this answer, and it does indeed deserialise the entire dictionary into a series of class objects. Here is the code I've got so far:

ReadOnly jmdictpath As String = "JMdict"

<XmlRoot>
Public Class JMdict
    <XmlElement("entry")>
    Public Property entrylist As List(Of entry)
End Class

<Serializable()>
Public Class entry
    Public Property ent_seq As Integer
    Public Property k_ele As k_ele
    Public Property r_ele As r_ele
    <XmlElement("sense")>
    Public Property senselist As List(Of sense)
End Class

<Serializable()>
Public Class k_ele
    Public Property keb As String
    Public Property ke_pri As List(Of String)
    Public Property ke_inf As List(Of String)
End Class

<Serializable()>
Public Class r_ele
    Public Property reb As String
    Public Property re_pri As List(Of String)
    Public Property ke_inf As List(Of String)
End Class

<Serializable()>
Public Class sense
    <XmlElement("pos")>
    Public Property pos As List(Of string)
    <XmlElement("gloss")>
    Public Property gloss As List(Of gloss)
End Class

<Serializable()>
Public Class gloss
    <XmlAttribute("xml:lang")>
    Public Property lang As String
    <XmlAttribute("g_type")>
    Public Property g_type As String
    <XmlText>
    Public Property Text As String
    Public Overrides Function ToString() As String
        Return Text
    End Function
End Class

Dim dict As JMdict

Sub Deserialise()
    Dim serialiser As New XmlSerializer(GetType(JMdict))
    Using sr As New StreamReader(jmdictpath)
        dict = CType(serialiser.Deserialize(sr), JMdict)
    End Using
End Sub

When I run the code, however, I get the following error:

System.InvalidOperationException: 'There is an error in XML document (415, 7).'

XmlException: Unexpected node type EntityReference. ReadElementString method can only be called on elements with simple or empty content. Line 415, position 7.

I've checked the XML, and line 415 is this line:

 <pos>&unc;</pos>

So the deserialiser is having problems reading the <pos> tag. So I tried a few things.

First I tried removing the <XMLElement> tag for pos in the sense class. Doing this meant that there was no error, but also, the deserialiser simply didn't read any data for pos for any of the entries.

Second, I checked on StackOverflow and found this related question where OP had the same problem. The accepted answer in this question suggested splitting the data into further classes, so I tried that too, and created a new pos class:

<Serializable()>
Public Class sense
    <XmlElement("pos")>
    Public Property pos As List(Of pos)
    <XmlElement("gloss")>
    Public Property gloss As List(Of gloss)
End Class

<Serializable()>
Public Class pos
    <XmlText>
    Public Property Text As String
    Public Overrides Function ToString() As String
        Return Text
    End Function
End Class

And once again, while this caused no errors, the pos element was blank in every entry. Each pos tag only contains one value - although there can be more than one pos tag per sense tag - so I didn't think it should need its own class object. In any case, this answer didn't solve my problem, hence why I'm asking this question.

I am completely new to XML deserialisation, and don't really understand what I'm doing in-depth - I'm trying to figure out the mechanics of it based on this helpful answer, but I'm obviously doing something wrong here. Any advice would be appreciated.

Community
  • 1
  • 1
Lou
  • 2,200
  • 2
  • 33
  • 66
  • I guess `&unc;` is supposed to be some kind of escaped character, but I can't figure out what it is. Do you know? – djv Mar 19 '20 at 18:00
  • According to the doctype definition, it just means "unclassified": `<!ENTITY unc "unclassified">` . All of the values for the `pos` tag have an `&` before and `;` after – Lou Mar 19 '20 at 18:07
  • You evidently have the Document Type Definition file available, is it properly referenced in the xml's `DOCTYPE` tag? i.e.: ` `? If so, this is just requires a simple fix of creating an `XmlReader` with the proper settings that include a `XmlUrlResolver`. – TnTinMn Mar 19 '20 at 22:01
  • Oh no, the doctype definition is in the same file as all the other XML. It's all in one single file. – Lou Mar 19 '20 at 22:48
  • "the doctype definition is in the same file as all the other XML" - same principle, but one less step as external. I'll post an example soon. – TnTinMn Mar 19 '20 at 22:58
  • That would be great :). As I said in the OP I'm completely new to XML parsing and don't really know what I'm doing! So any input is helpful. – Lou Mar 19 '20 at 22:58

1 Answers1

1

You just need to create the XmlSerializer with a XmlReaderwith the properly configured XmlReaderSettings. The only thing you need to configure in the settings is the DtdProcessing Property setting it equal to DtdProcessing.Parse.

Dim settings As XmlReaderSettings = New XmlReaderSettings()
settings.DtdProcessing = DtdProcessing.Parse

Dim xmlPath As String = Path.Combine(Application.StartupPath, "yourfilename.xml")

Dim ser As New XmlSerializer(GetType(JMdict))

Dim JMdictInstance As JMdict
Using rdr As XmlReader = XmlReader.Create(xmlPath, settings)
   JMdictInstance = CType(ser.Deserialize(rdr), JMdict)
End Using
TnTinMn
  • 11,522
  • 3
  • 18
  • 39
  • Thanks! I've just tried this, and it does work, although for some reason I needed to change it to `XML.XMLReader` even though I've already got `System.XML` and `System.XML.Serialization` declared. It now reads the `pos` correctly! – Lou Mar 19 '20 at 23:19
  • 1
    @Lou, I'm glad to hear that. FYI: If you ever need to access an external DTD, just set the `settings.XmlResolver = New XmlUrlResolver With {.Credentials = CredentialCache.DefaultCredentials}` to allow it to access the DDT declared in the DocType tag. Also it was nice to see a question with most of the needed info on this site for once. – TnTinMn Mar 19 '20 at 23:29
  • Ha, no problem. I'm looking forward to playing around with the deserialiser; I built a rusty string-based parser without knowing anything about deserialisation, and I didn't realise that if you do it properly you can see all the information that was defined in the DTD, not just the tags! – Lou Mar 20 '20 at 00:03