1

I'm writing a VB.NET application to parse a large XML file which is a Japanese dictionary. I'm completely new to XML parsing and don't really know what I'm doing. The whole dictionary fits between two XML tags <jmdict> and </jmdict>. The next level is the <entry>, which contains all information for the 1 million entries, including the form, pronunciation, meaning of the word and so on.

A typical entry might look like this:

<entry>
<ent_seq>1486440</ent_seq>
<k_ele>
<keb>美術</keb>
<ke_pri>ichi1</ke_pri>
<ke_pri>news1</ke_pri>
<ke_pri>nf02</ke_pri>
</k_ele>
<r_ele>
<reb>びじゅつ</reb>
<re_pri>ichi1</re_pri>
<re_pri>news1</re_pri>
<re_pri>nf02</re_pri>
</r_ele>
<sense>
<pos>&n;</pos>
<pos>&adj-no;</pos>
<gloss>art</gloss>
<gloss>fine arts</gloss>
</sense>
<sense>
<gloss xml:lang="dut">kunst</gloss>
<gloss xml:lang="dut">schone kunsten</gloss>
</sense>
<sense>
<gloss xml:lang="fre">art</gloss>
<gloss xml:lang="fre">beaux-arts</gloss>
</sense>
<sense>
<gloss xml:lang="ger">Kunst</gloss>
<gloss xml:lang="ger">die schönen Künste</gloss>
<gloss xml:lang="ger">bildende Kunst</gloss>
</sense>
<sense>
<gloss xml:lang="ger">Produktionsdesign</gloss>
<gloss xml:lang="ger">Szenographie</gloss>
</sense>
<sense>
<gloss xml:lang="hun">művészet</gloss>
<gloss xml:lang="hun">művészeti</gloss>
<gloss xml:lang="hun">művészi</gloss>
<gloss xml:lang="hun">rajzóra</gloss>
<gloss xml:lang="hun">szépművészet</gloss>
</sense>
<sense>
<gloss xml:lang="rus">изящные искусства; искусство</gloss>
<gloss xml:lang="rus">{~{的}} художественный, артистический</gloss>
</sense>
<sense>
<gloss xml:lang="slv">umetnost</gloss>
<gloss xml:lang="slv">likovna umetnost</gloss>
</sense>
<sense>
<gloss xml:lang="spa">bellas artes</gloss>
</sense>
</entry>

I have a class object, Entry, which is used to store all of the information contained in an entry like the one above. I know what all the tags mean, I don't have an issue with interpreting the data semantically, I'm just not sure what tools I need to actually parse all of this information.

For example, how should I extract the contents of the <ent_seq> tag at the beginning? And is the method used to extract information from an XML tag the same even it's contained within a parent tag, as in the <keb> and <ke_pri> tags which are contained within the <k_ele> tags? Or should I use a different method?

I know this reads like homework help - I'm not asking for someone to provide the complete solution and build the parser. I just don't know where to start and what tools to use. I'd really appreciate some guidance on what methods I need to start parsing the XML file, and then I'll work on building the solution myself once I know what I'm doing.

-

Edit

So I've come across this code from this website which uses XMLReader to go through one node at a time:

Dim readXML As XmlReader = XmlReader.Create(New StringReader(xmlNode))
While readXML.Read()
    Select Case readXML.NodeType
        Case XmlNodeType.Element
            ListBox1.Items.Add("<" + readXML.Name & ">")
            Exit Select
        Case XmlNodeType.Text
            ListBox1.Items.Add(readXML.Value)
            Exit Select
        Case XmlNodeType.EndElement
            ListBox1.Items.Add("")
            Exit Select
    End Select
End While

But I get the error on the first line

'XmlNode' is a class type and cannot be used as an expression

I'm not exactly sure what to do about this error - any ideas?

Lou
  • 2,200
  • 2
  • 33
  • 66
  • 1
    Where is the data going to be stored after being read - e.g. is this to transfer it to a database? It might be that [VB.Net Xml Desearialization into a Class](https://stackoverflow.com/q/45168499/1115360) has the information you need. If you're going to get it to create the classes for you, I suggest using a sample with just 3 or 4 `` elements so that it can tell which items need to be plural. – Andrew Morton Feb 13 '20 at 11:06
  • So I already wrote a separate program to parse the XML and put it into a Database - for that I just basically used text substitution methods, no actual XML methods. That method takes about 30 seconds to complete, hence why I want to write a faster method. The program I'm writing down is just a Winform to view the dictionary, so once the data is read it's not going to go anywhere else. – Lou Feb 13 '20 at 11:17
  • Thanks for the suggestion - I tried the method listed in the answer to paste the XML as classes, but Visual Studio said that the XML isn't valid. It read this line: `&n;`, and said that there's an invalid entity 'n'. Weird. In any case, I already have a class structure and don't need a new one - the main thing I need to know is how to parse XML properly so that I can store it in the class objects. – Lou Feb 13 '20 at 11:21
  • The answers to that question also show how to deserialize the XML, given a compatible class. – Andrew Morton Feb 13 '20 at 11:27
  • 1
    Ah okay, so I didn't realise that what I'm trying to do is called deserialisation, that's helpful to know. I've also learned that I can store the whole of the XML file into an `XDocument` type using `XDocument.Load` - I'm not sure what I can do to deserialise this yet though. I'll see if I can figure anything out from the linked answer. – Lou Feb 13 '20 at 11:36
  • So from more research it looks like my two main options for deserialisation are using XMLReader and XDocument, with the former being faster. So I'll have a go at implementing the XMLReader and come back if I get stuck :) – Lou Feb 13 '20 at 11:43
  • I've found some code which uses the XMLReader option to go through each node one at a time, but it returns an error on the first line. I've edited it into my post. If you have any ideas how I can make it work, @AndrewMorton, I'd be grateful – Lou Feb 13 '20 at 11:53
  • 1
    If you follow the link to the "Full Source" on the page you referred to, you'll see that `xmlNode` is a variable which was populated earlier. I would try to avoid using names of class types as names of variables if I were you. – Andrew Morton Feb 13 '20 at 12:30
  • `deserialisation are using XMLReader and XDocument` is not accurate. The namespace [System.Xml.Serialization](https://learn.microsoft.com/en-us/dotnet/api/system.xml.serialization?view=netframework-4.8) defines the serialization / deserialization classes, namely the [XmlSerializer class](https://learn.microsoft.com/en-us/dotnet/api/system.xml.serialization.xmlserializer?view=netframework-4.8). Check it out. Note, this is what's used in the question linked by @AndrewMorton – djv Feb 13 '20 at 15:04
  • You can paste special as xml class. But you should do a couple of things to the xml above prior to that. The `&` escaping `n;` is invalid without some additional schema, I guess. Visual Studio won't recognize it. You can change `&n;` to `&n;`. Also you should add a root element such as you mentioned ``. And to let the tool know there are multiple `` you should have at least two of them. – djv Feb 13 '20 at 16:06

1 Answers1

2

You can use these classes to deserialize your xml quickly

Imports System.IO
Imports System.Xml.Serialization
<XmlRoot>
Public Class jmdict
    <XmlElement("entry")>
    Public Property entries As List(Of entry)
End Class
Public Class entry
    Public Property ent_seq As Integer
    Public Property k_ele As k_ele
    Public Property r_ele As r_ele
    <XmlElement("sense")>
    Public Property senses As List(Of sense)
End Class
Public Class sense
    <XmlElement("pos")>
    Public Property posses As List(Of String)
    <XmlElement("gloss")>
    Public Property glosses As List(Of gloss)
End Class
Public Class k_ele
    Public Property keb As String
    <XmlElement("ke_pri")>
    Public Property ke_pris As List(Of String)
End Class
Public Class r_ele
    Public Property reb As String
    <XmlElement("re_pri")>
    Public Property re_pris As List(Of String)
End Class
Public Class gloss
    <XmlAttribute("xml:lang")>
    Public Property lang As String
    <XmlText>
    Public Property Text As String
    Public Overrides Function ToString() As String
        Return Text
    End Function
End Class

The code to deserialize is

Dim serializer As New XmlSerializer(GetType(jmdict))
Dim d As jmdict
Using sr As New StreamReader("filename.xml")
    d = CType(serializer.Deserialize(sr), jmdict)
End Using

Now you can iterate over each entry, and the entries' senses, and the senses' glosses

For Each e In d.entries
    Console.WriteLine($"seq: {e.ent_seq}")
    For Each s In e.senses
        For Each g In s.glosses
            Console.WriteLine($"Text: {g.Text}, Lang: {g.lang}")
        Next
    Next
Next

The reasons your code takes so long are

  1. You are parsing xml as string
  2. You are inserting lines into a ListBox as you parse them

What do you want to put in the ListBox? If you have deserialized as I show, you can databind a specific list from the data, or a queried result of multiple lists.

djv
  • 15,168
  • 7
  • 48
  • 72
  • 1
    Thanks for the help and apologies for the delayed response - I've finally had a chance to test the code and it does deserialise the whole dictionary in a fraction of the time it took before. I'm getting lots of problems with implementing the deserialiser, but they're all separate questions, so I'll accept this for now and link back to it. – Lou Mar 19 '20 at 17:09