1

I'm having big trouble parsing some chinese characters encoded as HTML Unicode, embedded in XML files.

I'm using Java ME with javax.xml.parsers.SAXParser

One such character file is 词:

<test>&#35789;</test>


Info about it: http://www.isthisthingon.org/unicode/index.php?page=08&subpage=B&glyph=08BCD

But strangely 后

<test>&#21518;</test>

is working fine.

Directly embedding <test>词</test> also works.

My test midlet has the following source code:

import java.io.InputStream;
import javax.microedition.midlet.MIDlet;
import javax.microedition.midlet.MIDletStateChangeException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;

import org.xml.sax.helpers.DefaultHandler;


public class jaxp extends MIDlet {

public jaxp() {
}

protected void destroyApp(boolean unconditional)
throws MIDletStateChangeException {
}

protected void pauseApp() {
}

protected void startApp() throws MIDletStateChangeException {
    try {
        SAXParserFactory factory = SAXParserFactory.newInstance();
        SAXParser saxParser = factory.newSAXParser();
        DefaultHandler handler = new DefaultHandler() {};
        String fileName = "test.xml";
        InputStream is = jaxp.class.getResourceAsStream("/" + fileName);
        saxParser.parse(is, handler);
    } catch (Exception e) {
        e.printStackTrace();
    }

}
}

It's dying with:

org.xml.sax.SAXParseException: 
at org.xml.sax.helpers.DefaultHandler.fatalError(+1)
at com.sun.ukit.jaxp.Parser.panic(+18)
at com.sun.ukit.jaxp.Parser.ent(+586)
at com.sun.ukit.jaxp.Parser.elm(+434)
at com.sun.ukit.jaxp.Parser.parse(+199)
at com.sun.ukit.jaxp.Parser.parse(+47)
at com.sun.ukit.jaxp.Parser.parse(+31)
at jaxp.startApp(+83)
at javax.microedition.midlet.MIDletProxy.startApp(+7)
at com.nokia.mid.impl.isa.ui.MIDletManager.callStartApp(+4)
at com.nokia.mid.impl.isa.ui.MIDletManager.activateMIDlet(+10)
at com.nokia.mid.impl.isa.ui.MIDletManager.run(+15)

I'd appreciate any ideas.

Carl
  • 215
  • 1
  • 11
  • I'm not an ME programmer, but on regular Java, a SAXParseException gives some information about the cause, which might be useful here. – Ed Staub Jul 26 '11 at 00:35

1 Answers1

2

I am obviously late with this answer. Nevertheless, for the record...

I wrote this parser a good few years ago. In method ent version of the parser from JSR172 used Short.parseShort to convert entity value to a char. If the value goes above 32767 Short.parseShort throws NumberFormatException. This exception is caught in method ent and leads to call of method panic.

More recent version of the parser was used in JSR280. This version should be able to handle values above 32767 correctly.

Misha
  • 21
  • 3