Parsing big XML file with SAX parser, the class is becoming bloated and unreadable - How to fix this?

Question

This is purely a code readability related question, the performance of the class is not an issue.

Here is how I am building this XMLHandler :

For each element that is relevant to the application, I have a boolean in'ElementName' which I set to true or false depending on my location during the parsing : Problem, I now have 10+ boolean declaration at the beginning of my class and it is getting bigger and bigger.

In my startElement and in my endElement method, I have hundreds of line of

if (qName = "elementName") {
   ...
} else if (qName = "anotherElementName") {
   ...
}

with different parsing rules in them (if I am in this position in the xml file, do this, otherwise, do this etc...)

Coding new parsing rules and debugging is becoming increasingly painfull.

What are the best practices for coding a sax parser, and what can I do to make my code more readable ?

Just out of curiosity, is there a reason why you chose to go the SAX route, rather than DOM parsers. Is the file size > a few 100 KB? I only wonder b/c for small enough files, any performance gain is almost negligible. Same goes for xml to java binding solutions like some of the comments have mentioned — Java Drinker, Aug 26 '10 at 13:05
The XML is between 30k (2000KB) and 100k (5000KB) lines long. God bless XML -_- — SAKIROGLU Koray, Aug 27 '10 at 08:17

Adrian Pronk · Answer 1 · 2010-08-30T20:51:41.010

What do you use the boolean variables for? To keep track of nesting?

I recently implemented this by using an enum for every element. The code is at work but this is a rough approximation of it off the top of my head:

enum Element {
   // special markers:
   ROOT,
   DONT_CARE,

   // Element               tag                  parents
   RootElement(             "root"               ROOT),
   AnElement(               "anelement"),     // DONT_CARE
   AnotherElement(          "anotherelement"),// DONT_CARE
   AChild(                  "child",             AnElement),
   AnotherChild(            "child",             AnotherElement);

   Element() {...}
   Element(String tag, Element ... parents) {...}
}

class MySaxParser extends DefaultHandler {
    Map<Pair<Element, String>, Element> elementMap = buildElementMap();
    LinkedList<Element> nestingStack = new LinkedList<Element>();

    public void startElement(String namespaceURI, String sName, String qName, Attributes attrs) {
        Element parent = nestingStack.isEmpty() ? ROOT : nestingStack.lastElement();
        Element element = elementMap.get(pair(parent, sName));
        if (element == null)
            element = elementMap.get(DONT_CARE, sName);
        if (element == null)
            throw new IllegalStateException("I did not expect <" + sName + "> in this context");

        nestingStack.addLast(element);

        switch (element) {
        case RootElement: ... // Probably don't need cases for many elements at start unless we have attributes
        case AnElement: ...
        case AnotherElement: ...
        case AChild: ...
        case AnotherChild: ...
        default: // Most cases here. Generally nothing to do on startElement
        }
    }
    public void endElement(String namespaceURI, String sName, String qName) {
        // Similar to startElement() but most switch cases do something with the data.
        Element element = nestingStack.removeLast();
        if (!element.tag.equals(sName)) throw IllegalStateException();
        switch (element) {
           ...
        }
    }

    // Construct the structure map from the parent information.
    private Map<Pair<Element, String>, Element> buildElementMap() {
        Map<Pair<Element, String>, Element> result = new LinkedHashMap<Pair<Element, String>, Element>();
        for (Element element: Element.values()) {
            if (element.tag == null) continue;
            if (element.parents.length == 0)
                result.put(pair(DONT_CARE, element.tag), element);
            else for (Element parent: element.parents) {
                result.put(pair(parent, element.tag), element);
            }
        }
        return result;
    }
    // Convenience method to avoid the need for using "new Pair()" with verbose Type parameters 
    private <A,B> Pair<A,B> pair(A a, B b) {
        return new Pair<A, B>(a, b);
    }
    // A simple Pair class, just for completeness.  Better to use an existing implementation.
    private static class Pair<A,B> {
        final A a;
        final B b;
        Pair(A a, B b){ this.a = a; this.b = b;}
        public boolean equals(Object o) {...};
        public int hashCode() {...};
    }
}

Edit:
The position within the XML structure is tracked by a stack of elements. When startElement is called, the appropriate Element enum can be determined by using 1) the parent element from the tracking stack and 2) the element tag passed as the sName parameter as the key to a Map generated from the parent information defined as part of the Element enum. The Pair class is simply a holder for the 2-part key.

This approach allows the same element-tag that appears repeatedly in different parts of the XML structure with different semantics to be represented by different Element enums. For example:

<root>
  <anelement>
    <child>Data pertaining to child of anelement</child>
  </anelement>      
  <anotherelement>
    <child>Data pertaining to child of anotherelement</child>
  </anotherelement>
</root>

Using this technique, we don't need to use flags to track the context so that we know which <child> element is being processed. The context is declared as part of the Element enum definition and reduces confusion by eliminating assorted state variables.

I understand using a linked map for tracking my position in the XML file, but what is the purpose of the Pair ? — SAKIROGLU Koray, Aug 30 '10 at 08:30
@SAKIROGLU Koray: The Pair is simply a holder for a two-value Map key. I've edited my answer to clarify that. — Adrian Pronk, Aug 30 '10 at 20:33

score 0 · Answer 2 · answered Aug 26 '10 at 10:00

0

It depends on the XML structure. If the actions for different cases are easy or (more or less) "independent", you could try to use a map:

interface Command {
   public void assemble(Attributes attr, MyStructure myStructure);
}
...

Map<String, Command> commands= new HashMap<String, Command>();
...
if(commands.contains(qName)) {
   commands.get(qname).assemble(attr, myStructur);
} else {
   //unknown qName
}

answered Aug 26 '10 at 10:00

Landei

54,104
13
100
195

So if I follow this right, I have to implement a different assemble method for each qName ? This would indeed reduce my list of if/else statement to 6 lines, but I don't really know how you would go to have different actions for each command, any example ? – SAKIROGLU Koray Aug 26 '10 at 12:32
It is hard to say if this style is actually better than the if-else-cascade. It really all depends on what you need to do in every case. If you build one big object or have to switch some setting, the Map style is probably better. If you have complicated interactions between the cases or need a lot of intermediate data structures, it gets probably too messy. – Landei Aug 26 '10 at 12:47

score 0 · Answer 3 · answered Aug 26 '10 at 10:08

0

I would fallback to JAXB or something equivalent and let the framework do the work.

answered Aug 26 '10 at 10:08

Claude Vedovini

2,421
21
19

Can you be more specific ? I don't really know much about JAXB, Is it as efficient or more efficient than SAX ? How would this solve my problem ? If I understand right what I read about it, I would only have to create a schema for the mapping and then let JAXB do all the parsing ? – SAKIROGLU Koray Aug 27 '10 at 12:05
use JAXB to create a Java DOM from the schema then use it to parse documents into DOM instances. Internally it uses SAX to parse so it should be about the same performance wise. However at the end of the parsing you'll get an in-memory object tree so depending of what you were actually doing before you might end up with more memory needed. And this might be a problem because the bigger the document the more memory you'll need. – Claude Vedovini Aug 31 '10 at 20:49

Parsing big XML file with SAX parser, the class is becoming bloated and unreadable - How to fix this?

3 Answers3